[ 
https://issues.apache.org/jira/browse/HBASE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002551#comment-13002551
 ] 

dhruba borthakur commented on HBASE-3604:
-----------------------------------------

I did see this in the master logs (at the time when it decided to reassign the 
region from A to B):

{quote}
java.io.IOException: Discovered orphan hlog after split. Maybe HRegionServer 
was not dead when we started
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:290)
at 
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:151)
at 
org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:193)
at 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:96)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
2011-03-03 05:09:55,551 INFO 
org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning 64 
region(s) that pumahbase024.snc5.facebook.com,60020,1299034310798 was carrying 
(skipping 0 regions(s) that are already in transition)

2011-03-03 05:09:55,588 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Assigning region XXXXs,aae1476e,1290555944175.c155464b43a3267c4a2778b769026775. 
to B.com,60020,1299034311241

> Two region servers think that they own the same region: data loss
> -----------------------------------------------------------------
>
>                 Key: HBASE-3604
>                 URL: https://issues.apache.org/jira/browse/HBASE-3604
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> I observed this on a 100 node cluster that is constantly doing about 500K 
> ops/second.
> The region server on machine A was servicing IOs for a particular region. 
> Then the machine went into a bad state where it is ping-able but not 
> ssh-able. The master detected that there is a problem with machine A and 
> reassigned the region to machine B. The regionserver on machine B opened the 
> region and opened all the required HFiles for this region. After two hours, 
> the NameNode received a delete request for one of the HFiles from machine A 
> and happily renamed the file to HDFS-Trash. After another 3 hours or so, the 
> regionserver on machine B tried to read contents from that HFile but failed 
> because the file was renamed earlier. The region server on B in now stuck, 
> and possible data loss. 
> The problems stems from the fact that although the master-and-ZK reassigned 
> the region, the old regionserver was not possibly dead.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to