Two region servers think that they own the same region: data loss
-----------------------------------------------------------------

                 Key: HBASE-3604
                 URL: https://issues.apache.org/jira/browse/HBASE-3604
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.90.0
            Reporter: dhruba borthakur


I observed this on a 100 node cluster that is constantly doing about 500K 
ops/second.

The region server on machine A was servicing IOs for a particular region. Then 
the machine went into a bad state where it is ping-able but not ssh-able. The 
master detected that there is a problem with machine A and reassigned the 
region to machine B. The regionserver on machine B opened the region and opened 
all the required HFiles for this region. After two hours, the NameNode received 
a delete request for one of the HFiles from machine A and happily renamed the 
file to HDFS-Trash. After another 3 hours or so, the regionserver on machine B 
tried to read contents from that HFile but failed because the file was renamed 
earlier. The region server on B in now stuck, and possible data loss. 

The problems stems from the fact that although the master-and-ZK reassigned the 
region, the old regionserver was not possibly dead.


-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to