On 9/9/08 11:19 PM, "Jim Kellerman" <[EMAIL PROTECTED]> wrote:
> I think there is a problem, but maybe not the one we surmised. Are
> there any lease timeout reports in your master log?
Like this?
2008-09-08 10:43:01,640 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.54:60020 lease expired
>
> When a lease is timed out, all the regions being served by the region
> server get reassigned almost immediately, which could be the cause of
> a region being assigned to another server while the original server
> still thought it had exclusive access to the region.
The region A is users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
Here the master decides that the region server is overloaded
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Server 10.131.237.54:60020 is overloaded. Server load: 52 avg: 48.0
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Choosing to reassign 4 regions. mostLoadedRegions has 10 regions in it.
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,2FF6F81E-1197-4BEB-97D2-E0AFEB529B22,1220891020447
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,24F026E2-DDE1-469F-8A1A-0638759E6A2A,1220887834020
2008-09-08 09:50:41,739 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Going to close region
users,F167933A-BCBE-4202-B453-0EA1A629F8D5,1220888091404
Then the split finishes
2008-09-08 10:01:34,385 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_SPLIT:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED] from
10.131.237.54:60020
Then many..
2008-09-08 10:43:13,541 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0A
A6A0A62C', ENCODED => 1126800163, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'users', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'video', BLOOMFILTER => 'false', VERSIONS => '3', COMPRESSION =>
'NONE', LENGTH => '2147483647', TTL => '-1', IN_M
EMORY => 'false', BLOCKCACHE => 'false'}]}}}, SERVER =>
'10.131.237.54:60020', STARTCODE => 1220892350279
Then the server shuts down
2008-09-08 10:44:06,552 DEBUG
org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
server shutdown scanning root region on 10.131.236.91
...
Then the lease expires
2008-09-08 10:43:01,640 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.54:60020 lease expired
And then it assigns the region ..
2008-09-08 10:44:12,656 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 to
server 10.131.237.51:60020
Then...
2008-09-08 10:46:41,493 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_PROCESS_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.51:60020
.................................
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.51:60020
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,8C4A27C9-B808-406A-A722-F1BE374F89EF,1220857848626 from
10.131.237.51:60020
2008-09-08 10:47:23,753 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Server 10.131.237.51:60020 is overloaded. Server load: 92 avg: 82.0
2008-09-08 10:47:23,753 DEBUG org.apache.hadoop.hbase.master.RegionManager:
Choosing to reassign 10 regions. mostLoadedRegions has 10 regions in it.
This is interesting (notice that everything above and beneath is in the same
second):
The region is open but
2008-09-08 10:47:23,753 INFO org.apache.hadoop.hbase.master.RegionManager:
Skipping region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because it is already closing.
...
2008-09-08 10:48:12,140 INFO org.apache.hadoop.hbase.master.RegionManager:
Skipping region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because it is already closing.
The region had been actually opened on the machine but
2008-09-08 10:48:13,374 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0A
A6A0A62C', ENCODED => 1126800163, OFFLINE => true, SPLIT => true, TABLE =>
{{NAME => 'users', IS_ROOT => 'false', IS_META => 'false', FAMILIES =>
[{NAME => 'video', BLOOMFILTER => 'false', VERSIONS => '3', COMPRESSION =>
'NONE', LENGTH => '2147483647', TTL => '-1', IN_M
EMORY => 'false', BLOCKCACHE => 'false'}]}}}, SERVER =>
'10.131.237.51:60020', STARTCODE => 1220882303813
2008-09-08 11:03:19,849 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220893282341 no longer has
references to users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
And then the second split
2008-09-08 11:07:07,906 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_SPLIT:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED] from
10.131.237.51:60020
And then after a cluster shutdown
2008-09-08 11:24:00,379 DEBUG
org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
server shutdown scanning root region on 10.131.236.91
The the lease expires
2008-09-08 11:22:49,504 INFO org.apache.hadoop.hbase.master.ServerManager:
10.131.237.51:60020 lease expired
It assigns it again..
2008-09-08 11:24:04,994 INFO org.apache.hadoop.hbase.master.RegionManager:
assigning region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 to
server 10.131.237.53:60020
2008-09-08 11:26:45,307 INFO org.apache.hadoop.hbase.master.ServerManager:
Received MSG_REPORT_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 from
10.131.237.53:60020
2008-09-08 11:40:14,452 DEBUG org.apache.hadoop.hbase.master.BaseScanner:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220897189382 no longer has
references to users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:40:14,455 INFO org.apache.hadoop.hbase.master.BaseScanner:
Deleting region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
because daughter splits no longer hold references
Log grep on 10.131.237.54 aka R1
2008-09-08 09:45:57,306 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:46:00,847 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:46:00,848 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 09:46:01,394 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 09:46:01,449 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 09:46:01,449 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:50:41,743 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
2008-09-08 09:56:34,587 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:21,868 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 in 4mins, 47sec
2008-09-08 10:01:21,938 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting split of region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,342 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,343 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,415 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:22,416 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:01:31,322 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0AA6A0A62C', ENCODED => 1126800163, OFFLINE =>
true, SPLIT => true, TABLE => {{NAME => 'users', IS_ROOT => 'false', IS_META
=> 'false', FAMILIES => [{NAME => 'video', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1',
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}, new regions:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220893282341,
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220893282341. Split took 9sec
2008-09-08 10:11:58,606 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
The grep of the region log on 10.131.237.51 aka R2
2008-09-08 09:22:37,183 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:22:37,196 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:22:37,196 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 09:22:37,468 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 09:22:37,469 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 09:22:37,469 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,447 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
2008-09-08 09:39:11,448 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_CLOSE:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108: [EMAIL PROTECTED]
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 09:39:11,448 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:44:12,689 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:47:21,796 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 10:47:21,797 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Opening region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
2008-09-08 10:47:22,308 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Next sequence id for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 is 201561483
2008-09-08 10:47:22,309 INFO org.apache.hadoop.hbase.regionserver.HRegion:
region users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108/1126800163
available
2008-09-08 10:47:22,309 DEBUG
org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
requested for region:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:23,207 INFO org.apache.hadoop.hbase.regionserver.HRegion:
starting compaction on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:24,305 INFO org.apache.hadoop.hbase.regionserver.HRegion:
compaction completed on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108 in 1sec
2008-09-08 11:06:24,385 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting split of region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,601 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Compactions and cache flushes disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Scanners disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more active scanners for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
No more row locks outstanding on region
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:06:29,602 INFO org.apache.hadoop.hbase.regionserver.HRegion:
closed users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108
2008-09-08 11:07:06,367 INFO
org.apache.hadoop.hbase.regionserver.CompactSplitThread: region split, META
updated, and report to master all successful. Old region=REGION => {NAME =>
'users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220890556108', STARTKEY =>
'15D050EC-CC9D-4683-B66E-17D58B2FD152', ENDKEY =>
'16D3F02B-AEAE-4273-A44D-2C0AA6A0A62C', ENCODED => 1126800163, OFFLINE =>
true, SPLIT => true, TABLE => {{NAME => 'users', IS_ROOT => 'false', IS_META
=> 'false', FAMILIES => [{NAME => 'video', BLOOMFILTER => 'false',
COMPRESSION => 'NONE', VERSIONS => '3', LENGTH => '2147483647', TTL => '-1',
IN_MEMORY => 'false', BLOCKCACHE => 'false'}]}}, new regions:
users,15D050EC-CC9D-4683-B66E-17D58B2FD152,1220897189382,
users,1652503D-DF0F-464A-B6AD-C6DA4707AE77,1220897189382. Split took 41sec
I understand now how unassignedRegions gets updated. However on a
leaseExpired situation the only method to modify it is unassignRootRegion
and this should only happen if the server with the expired lease holds the
root region.
Thanks,
Cosmin
>
> There are a couple of problems going on here if you see lease
> timeouts. One is that the region server is not sending its heartbeat
> soon enough (See HBASE-616 "We slept XXXXXX ms, ten times longer than
> scheduled: 3000" happens frequently."). This can happen on both the
> master and the region server meaning either the master does not
> process the heartbeat until after the lease times out, or the region
> server does not send the heartbeat until after the lease times out.
>
> This can be caused by thread starvation due to either too many
> runnable threads on a machine, not enough CPUs to handle the thread
> load, or just a bad thread scheduler.
>
> Hopefully, lease timeout will work better after HBase is integrated
> with Zookeeper.
>
> =====
>
> In the normal case, the master will not reassign a region due to load
> balancing until the region server reports that it has closed the
> region:
>
> Nothing happens with the mostLoadedRegions until it gets to
> RegionManager.unassignSomeRegions which is called by
> RegionManager.assignRegions
>
> In unassignSomeRegions, any regions selected are put into the
> closingRegions Set and a MSG_REGION_CLOSE will get sent to the region
> server. Candidates for assignment are only taken from the
> unassignedRegions Map.
>
> Not until the master receives a MSG_REPORT_CLOSE does any further
> action take place on that region.
>
> First the region is removed from the closingRegions Set.
>
> If the region was being split, the HRegionInfo received by the master
> will indicate that that the region is offline and split. In this case,
> it does not get reassigned.
>
> Otherwise it is added to the unassignedRegions Map and is now a
> candidate for reassignment.
>