[ 
https://issues.apache.org/jira/browse/HBASE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665221#action_12665221
 ] 

Jim Kellerman commented on HBASE-1123:
--------------------------------------

> stack - 19/Jan/09 12:13 PM
> The above seem like good stuff but how do any of the above impinge
> on this issue? 

Here's what I think happened:

- region server crashed
- lease timed out
- master starts recovery (can take quite a while to complete)
- region server restarts
- region server sends region server startup message to master
- master waits in rpc handler for old server cleanup (because it
  cannot differentiate the new instance from the old). 
- ipc from region server to master times out
- region server sends a new startup message. The master thread starts
  waiting in the rpc handler for old server cleanup.
- ipc from region server to master times out

...

This could easily result in all the master's region server rpc
handlers waiting for essentially the same event until
ProcessServerShutdown completes and removes the original dead server
from the dead servers list. (and if not all the master's region server
threads are tied up, it would severely impair the master's ability to
respond to other region server requests)



> Server never leaves the dead list though logs have all been processed if 
> crashed server had -ROOT- (seemingly)
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1123
>                 URL: https://issues.apache.org/jira/browse/HBASE-1123
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: 1123.patch
>
>
> Cluster is just hung after host that had -ROOT- completed splitting its 
> logs... old server is just stuck on the dead list and never comes off it.
> {code}
> ..
> 2009-01-13 01:09:36,448 [HMaster] DEBUG 
> org.apache.hadoop.hbase.regionserver.HLog: Splitting 6 of 6: 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020/hlog.dat.1231718928939
> 2009-01-13 01:09:37,396 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020 
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:38,591 [HMaster] DEBUG 
> org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for 
> path 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/712889985/oldlogfile.log
>  and region TestTable,0040922294,1231559109829
> 2009-01-13 01:09:38,670 [HMaster] DEBUG 
> org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for 
> path 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/484208094/oldlogfile.log
>  and region TestTable,0042007133,1231628296909
> 2009-01-13 01:09:45,096 [HMaster] INFO 
> org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for 
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020
> 2009-01-13 01:09:47,317 [SocketListener0-2] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location serverXX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:47,416 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020 
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:47,518 [IPC Server handler 3 on 60000] INFO 
> org.apache.hadoop.hbase.master.RegionManager: assigning region -ROOT-,,0 to 
> server XX.XX.XX141:60020
> 2009-01-13 01:09:49,007 [IPC Server handler 6 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Total Load: 430, Num Servers: 
> 3, Avg Load: 144.0
> 2009-01-13 01:09:50,219 [SocketListener0-0] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO 
> org.apache.hadoop.hbase.master.ServerManager: Received 
> MSG_REPORT_PROCESS_OPEN: -ROOT-,,0 from XX.XX.XX.141:60020
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO 
> org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN: 
> -ROOT-,,0 from 208.76.44.141:60020
> 2009-01-13 01:09:50,719 [SocketListener0-3] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:50,967 [SocketListener0-4] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location serverXX.XX.XX.142:60020, location 
> region name .META.,,1
> 2009-01-13 01:09:52,117 [SocketListener0-5] DEBUG 
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for 
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location 
> region name .META.,,1
> ....
> 2009-01-13 01:09:57,426 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020 
> removal from dead list before processing report-for-duty request
> ....
> 2009-01-13 01:10:45,156 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.HMaster: Processing todo: 
> ProcessServerShutdown of XX.XX.XX142:60020
> 2009-01-13 01:10:45,156 [HMaster] INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of 
> server XX.XX.XX.142:60020: logSplit: true, rootRescanned: false, 
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2009-01-13 01:10:45,156 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process 
> server shutdown scanning root region on XX.XX.XX.141
> 2009-01-13 01:10:45,182 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.RegionServerOperation: process server shutdown 
> scanning root region on XX.XX.XX.141 finished HMaster
> 2009-01-13 01:10:45,183 [HMaster] DEBUG 
> org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanMetaRegions: process 
> server shutdown scanning .META.,,1 on XX.XX.XX.142:60020
> 2009-01-13 01:10:47,496 [IPC Server handler 4 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020 
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:10:49,320 [IPC Server handler 8 on 60000] DEBUG 
> org.apache.hadoop.hbase.master.ServerManager: Total Load: 431, Num Servers: 
> 3, Avg Load: 144.0
> .....
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to