[
https://issues.apache.org/jira/browse/HBASE-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665216#action_12665216
]
Jim Kellerman commented on HBASE-1123:
--------------------------------------
There are a number of problems here:
- Leases should not only be identified by server name and port number
but also by the server start code. Since HLog directory names
include the start code, if a server should crash and restart before
the old lease expires, there is no danger of the new incarnation of
the server overwriting the old instance's HLog. It can then be put
back to work immediately.
- If a server's lease does time out, (because it hasn't reported in)
and the region server reports in, we should not wait in a region
server report thread in the master because cleaning up after the
server could take longer than IPC timeout.
- The order in which events happen when a region server receives a
MSG_CALL_SERVER_STARTUP is incorrect. The region server should call
reportForDuty before creating a new HLog.
> Server never leaves the dead list though logs have all been processed if
> crashed server had -ROOT- (seemingly)
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-1123
> URL: https://issues.apache.org/jira/browse/HBASE-1123
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.19.0
> Reporter: stack
> Assignee: Jim Kellerman
> Fix For: 0.20.0
>
> Attachments: 1123.patch
>
>
> Cluster is just hung after host that had -ROOT- completed splitting its
> logs... old server is just stuck on the dead list and never comes off it.
> {code}
> ..
> 2009-01-13 01:09:36,448 [HMaster] DEBUG
> org.apache.hadoop.hbase.regionserver.HLog: Splitting 6 of 6:
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020/hlog.dat.1231718928939
> 2009-01-13 01:09:37,396 [IPC Server handler 4 on 60000] DEBUG
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:38,591 [HMaster] DEBUG
> org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for
> path
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/712889985/oldlogfile.log
> and region TestTable,0040922294,1231559109829
> 2009-01-13 01:09:38,670 [HMaster] DEBUG
> org.apache.hadoop.hbase.regionserver.HLog: Creating new log file writer for
> path
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/TestTable/484208094/oldlogfile.log
> and region TestTable,0042007133,1231628296909
> 2009-01-13 01:09:45,096 [HMaster] INFO
> org.apache.hadoop.hbase.regionserver.HLog: log file splitting completed for
> hdfs://aa0-000-12.u.powerset.com:9000/hbasetrunk2/log_XX.XX.XX.142_1231717984112_60020
> 2009-01-13 01:09:47,317 [SocketListener0-2] DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location serverXX.XX.XX.142:60020, location
> region name .META.,,1
> 2009-01-13 01:09:47,416 [IPC Server handler 4 on 60000] DEBUG
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX142:60020
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:09:47,518 [IPC Server handler 3 on 60000] INFO
> org.apache.hadoop.hbase.master.RegionManager: assigning region -ROOT-,,0 to
> server XX.XX.XX141:60020
> 2009-01-13 01:09:49,007 [IPC Server handler 6 on 60000] DEBUG
> org.apache.hadoop.hbase.master.ServerManager: Total Load: 430, Num Servers:
> 3, Avg Load: 144.0
> 2009-01-13 01:09:50,219 [SocketListener0-0] DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location
> region name .META.,,1
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO
> org.apache.hadoop.hbase.master.ServerManager: Received
> MSG_REPORT_PROCESS_OPEN: -ROOT-,,0 from XX.XX.XX.141:60020
> 2009-01-13 01:09:50,539 [IPC Server handler 2 on 60000] INFO
> org.apache.hadoop.hbase.master.ServerManager: Received MSG_REPORT_OPEN:
> -ROOT-,,0 from 208.76.44.141:60020
> 2009-01-13 01:09:50,719 [SocketListener0-3] DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location
> region name .META.,,1
> 2009-01-13 01:09:50,967 [SocketListener0-4] DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location serverXX.XX.XX.142:60020, location
> region name .META.,,1
> 2009-01-13 01:09:52,117 [SocketListener0-5] DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers: Cache hit for
> row <> in tableName .META.: location server XX.XX.XX.142:60020, location
> region name .META.,,1
> ....
> 2009-01-13 01:09:57,426 [IPC Server handler 4 on 60000] DEBUG
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020
> removal from dead list before processing report-for-duty request
> ....
> 2009-01-13 01:10:45,156 [HMaster] DEBUG
> org.apache.hadoop.hbase.master.HMaster: Processing todo:
> ProcessServerShutdown of XX.XX.XX142:60020
> 2009-01-13 01:10:45,156 [HMaster] INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server XX.XX.XX.142:60020: logSplit: true, rootRescanned: false,
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2009-01-13 01:10:45,156 [HMaster] DEBUG
> org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanRootRegion: process
> server shutdown scanning root region on XX.XX.XX.141
> 2009-01-13 01:10:45,182 [HMaster] DEBUG
> org.apache.hadoop.hbase.master.RegionServerOperation: process server shutdown
> scanning root region on XX.XX.XX.141 finished HMaster
> 2009-01-13 01:10:45,183 [HMaster] DEBUG
> org.apache.hadoop.hbase.master.ProcessServerShutdown$ScanMetaRegions: process
> server shutdown scanning .META.,,1 on XX.XX.XX.142:60020
> 2009-01-13 01:10:47,496 [IPC Server handler 4 on 60000] DEBUG
> org.apache.hadoop.hbase.master.ServerManager: Waiting on XX.XX.XX.142:60020
> removal from dead list before processing report-for-duty request
> 2009-01-13 01:10:49,320 [IPC Server handler 8 on 60000] DEBUG
> org.apache.hadoop.hbase.master.ServerManager: Total Load: 431, Num Servers:
> 3, Avg Load: 144.0
> .....
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.