[
https://issues.apache.org/jira/browse/HBASE-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated HBASE-2707:
--------------------------------------
Attachment: HBASE-2707.patch
Stack and I talked a lot about it, here's what we came up with. It's very hard
for me to come up with a unit test since it's all deep in the master and very
much time-based, but I tested the patch with TestReplication a lot and 1) it
doesn't fail anymore and 2) I see in the logs that the master does the right
thing.
Should I commit this?
> Can't recover from a dead ROOT server if any exceptions happens during log
> splitting
> ------------------------------------------------------------------------------------
>
> Key: HBASE-2707
> URL: https://issues.apache.org/jira/browse/HBASE-2707
> Project: HBase
> Issue Type: Bug
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: 0.21.0
>
> Attachments: HBASE-2707.patch
>
>
> There's an almost easy way to get stuck after a RS holding ROOT dies, usually
> from a GC-like event. It happens frequently to my TestReplication in
> HBASE-2223.
> Some logs:
> {code}
> 2010-06-10 11:35:52,090 INFO [master] wal.HLog(1175): Spliting is done.
> Removing old log dir
> hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> 2010-06-10 11:35:52,095 WARN [master]
> master.RegionServerOperationQueue(183): Failed processing:
> ProcessServerShutdown of 10.10.1.63,55846,1276194933831; putting onto delayed
> todo queue
> java.io.IOException: Cannot delete:
> hdfs://localhost:55814/user/jdcryans/.logs/10.10.1.63,55846,1276194933831
> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.splitLog(HLog.java:1179)
> at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
> at
> org.apache.hadoop.hbase.master.RegionServerOperationQueue.process(RegionServerOperationQueue.java:149)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:456)
> Caused by: java.io.IOException: java.io.IOException:
> /user/jdcryans/.logs/10.10.1.63,55846,1276194933831 is non empty
> 2010-06-10 11:35:52,097 DEBUG [master]
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process
> delayedToDoQueue items
> 2010-06-10 11:35:53,098 DEBUG [master]
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process
> delayedToDoQueue items
> 2010-06-10 11:35:53,523 INFO [main.serverMonitor]
> master.ServerManager$ServerMonitor(131): 1 region servers, 1 dead, average
> load 14.0[10.10.1.63,55846,1276194933831]
> 2010-06-10 11:35:54,099 DEBUG [master]
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process
> delayedToDoQueue items
> 2010-06-10 11:35:55,101 DEBUG [master]
> master.RegionServerOperationQueue(126): -ROOT- isn't online, can't process
> delayedToDoQueue items
> {code}
> The last lines are my own debug. Since we don't process the delayed todo if
> ROOT isn't online, we'll never reassign the regions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.