Just to follow up on the discussion.
I pasted the logs in the wrong order. There's a lease expired entry followed by
a process server shutdown log entry.
It appears that the region assignment is caused by
ProcessServerShutdown.scanMetaRegion()
// Get regions reassigned for (HRegionInfo info: regions) {
master.regionManager.setUnassigned(info); }
And from here they get reassigned by calling assignRegionsToMultipleServers in
RegionManager.assignRegions
If this is correct, than the corruption (in my case) happens when the region
server shuts down. However, both setUnnasigned() method and other methods that
change the state of RegionManager.unnassignedRegions are called from many
places that could be potential points where this scenario could happen.
Can we talk with certainty about a clear bug now?
Thanks,
Cosmin
P.S. All 5 machines were under heavy load for more than 14 hours (30-100%
waiting for disk). I inserted more than 300GB of data during a 6 hour period,
however there were only 60-70 regions at the end of the six hours. After that I
restarted the cluster and then it took forever to compact the data (there were
many compactions that had between 50 and 100 files to compact). This is (I
guess) another bug that we hit.
On 9/10/08 2:06 AM, "Cosmin Lehene" <[EMAIL PROTECTED]> wrote:
On 9/9/08 11:19 PM, "Jim Kellerman" <[EMAIL PROTECTED]> wrote:
> I think there is a problem, but maybe not the one we surmised. Are
> there any lease timeout reports in your master log?
Like this?
2008-09-08 10:43:01,640 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.54:60020 lease expired
>
> When a lease is timed out, all the regions being served by the region
> server get reassigned almost immediately, which could be the cause of
> a region being assigned to another server while the original server
> still thought it had exclusive access to the region.
The region A is users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
Here the master decides that the region server is overloaded
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Server 10.131.237.54:60020 is overloaded. Server load: 52 avg: 48.0
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Choosing to reassign 4 regions. mostLoadedRegions has 10 regions in it.
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,2FF6F81E-1197-4BEB-97D2-E0AFEB529B22,1220891020447
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,24F026E2-DDE1-469F-8A1A-0638759E6A2A,1220887834020
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,F167933A-BCBE-4202-B453-0EA1A629F8D5,1220888091404
Then the split finishes
2008-09-08 10:01:34,385 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_SPLIT:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED] from
10.131.237.54:60020
Then many..
2008-09-08 10:43:13,541 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0A
A6A0A62C', ENCODED => 1126800163, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'users', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'video', BLOOMFILTER => 'false', VERSIONS => '3', COMPRESSION =>
'NONE', LENGTH => '2147483647', TTL => '-1', IN_M
EMORY => 'false', BLOCKCACHE => 'false'}]}}}, SERVER =>
'10.131.237.54:60020', STARTCODE => 1220892350279
Then the server shuts down
2008-09-08 10:44:06,552 DEBUG
org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
server shutdown scanning root region on 10.131.236.91
...
Then the lease expires
2008-09-08 10:43:01,640 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.54:60020 lease expired
And then it assigns the region ..
2008-09-08 10:44:12,656 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 to
server 10.131.237.51:60020
Then...
2008-09-08 10:46:41,493 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_PROCESS_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.51:60020
.................................
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.51:60020
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,8C4A27C9-B808-406A-A722-F1BE374F89EF,1220857848626 from
10.131.237.51:60020
2008-09-08 10:47:23,753 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Server 10.131.237.51:60020 is overloaded. Server load: 92 avg: 82.0
2008-09-08 10:47:23,753 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Choosing to reassign 10 regions. mostLoadedRegions has 10 regions in it.
This is interesting (notice that everything above and beneath is in the same
second):
The region is open but
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.RegionManager:
Skipping region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because it is already closing.
...
2008-09-08 10:48:12,140 INFO org.apache.hadoop.hbase.master.RegionManager:
Skipping region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because it is already closing.
The region had been actually opened on the machine but
2008-09-08 10:48:13,374 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0A
A6A0A62C', ENCODED => 1126800163, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'users', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'video', BLOOMFILTER => 'false', VERSIONS => '3', COMPRESSION =>
'NONE', LENGTH => '2147483647', TTL => '-1', IN_M
EMORY => 'false', BLOCKCACHE => 'false'}]}}}, SERVER =>
'10.131.237.51:60020', STARTCODE => 1220882303813
2008-09-08 11:03:19,849 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220893282341 no longer has
references to users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
And then the second split
2008-09-08 11:07:07,906 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_SPLIT:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED] from
10.131.237.51:60020
And then after a cluster shutdown
2008-09-08 11:24:00,379 DEBUG
org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
server shutdown scanning root region on 10.131.236.91
The the lease expires
2008-09-08 11:22:49,504 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.51:60020 lease expired
It assigns it again..
2008-09-08 11:24:04,994 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 to
server 10.131.237.53:60020
2008-09-08 11:26:45,307 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.53:60020
2008-09-08 11:40:14,452 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220897189382 no longer has
references to users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:40:14,455 INFO org.apache.hadoop.hbase.master.BaseScanner:
Deleting region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because daughter splits no longer hold references
Log grep on 10.131.237.54 aka R1
2008-09-08 09:45:57,306 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:46:00,847 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:46:00,848 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 09:46:01,394 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 09:46:01,449 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 09:46:01,449 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:50:41,743 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
2008-09-08 09:56:34,587 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:21,868 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 in 4mins, 47sec
2008-09-08 10:01:21,938 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting split of region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,342 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,343 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,415 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:31,322 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0AA6A0A62C', ENCODED => 1126800163, OFFLINE =>
true, SPLIT => true, TABLE => {{NAME => 'users', IS_ROOT => 'false', IS_META
=> 'false', FAMILIES => [{NAME => 'video', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1',
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}, new regions:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220893282341,
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220893282341. Split took 9sec
2008-09-08 10:11:58,606 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
The grep of the region log on 10.131.237.51 aka R2
2008-09-08 09:22:37,183 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:22:37,196 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:22:37,196 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 09:22:37,468 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 09:22:37,469 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 09:22:37,469 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,447 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
2008-09-08 09:39:11,448 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:44:12,689 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:47:21,796 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:47:21,797 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 10:47:22,308 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 10:47:22,309 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 10:47:22,309 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:23,207 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:24,305 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 in 1sec
2008-09-08 11:06:24,385 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting split of region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,601 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:07:06,367 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0AA6A0A62C', ENCODED => 1126800163, OFFLINE =>
true, SPLIT => true, TABLE => {{NAME => 'users', IS_ROOT => 'false', IS_META
=> 'false', FAMILIES => [{NAME => 'video', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1',
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}, new regions:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220897189382,
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220897189382. Split took 41sec
I understand now how unassignedRegions gets updated. However on a
leaseExpired situation the only method to modify it is unassignRootRegion
and this should only happen if the server with the expired lease holds the
root region.
Thanks,
Cosmin
>
> There are a couple of problems going on here if you see lease
> timeouts. One is that the region server is not sending its heartbeat
> soon enough (See HBASE-616 "We slept XXXXXX ms, ten times longer than
> scheduled: 3000" happens frequently."). This can happen on both the
> master and the region server meaning either the master does not
> process the heartbeat until after the lease times out, or the region
> server does not send the heartbeat until after the lease times out.
>
> This can be caused by thread starvation due to either too many
> runnable threads on a machine, not enough CPUs to handle the thread
> load, or just a bad thread scheduler.
>
> Hopefully, lease timeout will work better after HBase is integrated
> with Zookeeper.
>
> =====
>
> In the normal case, the master will not reassign a region due to load
> balancing until the region server reports that it has closed the
> region:
>
> Nothing happens with the mostLoadedRegions until it gets to
> RegionManager.unassignSomeRegions which is called by
> RegionManager.assignRegions
>
> In unassignSomeRegions, any regions selected are put into the
> closingRegions Set and a MSG_REGION_CLOSE will get sent to the region
> server. Candidates for assignment are only taken from the
> unassignedRegions Map.
>
> Not until the master receives a MSG_REPORT_CLOSE does any further
> action take place on that region.
>
> First the region is removed from the closingRegions Set.
>
> If the region was being split, the HRegionInfo received by the master
> will indicate that that the region is offline and split. In this case,
> it does not get reassigned.
>
> Otherwise it is added to the unassignedRegions Map and is now a
> candidate for reassignment.
>