SplitRegionHandler doesn't care if it deletes the znode or not, leaves the 
parent region stuck offline
------------------------------------------------------------------------------------------------------

                 Key: HBASE-4792
                 URL: https://issues.apache.org/jira/browse/HBASE-4792
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.92.0
            Reporter: Jean-Daniel Cryans
            Priority: Critical
             Fix For: 0.92.0, 0.94.0


Saw this on a little test cluster, really easy to trigger.

First the master log:

{quote}
2011-11-15 22:28:57,900 DEBUG 
org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handling SPLIT event 
for e5be6551c8584a6a1065466e520faf4e; deleting node
2011-11-15 22:28:57,900 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x132f043bbde08c1 Deleting existing unassigned node for 
e5be6551c8584a6a1065466e520faf4e that is in expected state RS_ZK_REGION_SPLIT
2011-11-15 22:28:57,975 WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
master:62003-0x132f043bbde08c1 Attempting to delete unassigned node in 
RS_ZK_REGION_SPLIT state but after verifying state, we got a version mismatch
2011-11-15 22:28:57,975 INFO 
org.apache.hadoop.hbase.master.handler.SplitRegionHandler: Handled SPLIT 
report); 
parent=TestTable,0001355346,1321396080924.e5be6551c8584a6a1065466e520faf4e. 
daughter 
a=TestTable,0001355346,1321396132414.df9b549eb594a1f8728608a2a431224a.daughter 
b=TestTable,0001368082,1321396132414.de861596db4337dc341138f26b9c8bc2.
...
2011-11-15 22:28:58,052 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: 
Handling transition=RS_ZK_REGION_SPLIT, server=sv4r28s44,62023,1321395865619, 
region=e5be6551c8584a6a1065466e520faf4e
2011-11-15 22:28:58,052 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Region e5be6551c8584a6a1065466e520faf4e not found on server 
sv4r28s44,62023,1321395865619; failed processing
2011-11-15 22:28:58,052 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received SPLIT for region e5be6551c8584a6a1065466e520faf4e from server 
sv4r28s44,62023,1321395865619 but it doesn't exist anymore, probably already 
processed its split
(repeated forever)
{quote}

The master processes the split but when it calls ZKAssign.deleteNode it doesn't 
check the boolean that's returned. In this case it was false. So for the master 
the split was completed, but for the region server it's another story:

{quote}
2011-11-15 22:28:57,661 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:62023-0x132f043bbde08d3 Attempting to transition node 
e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLITTING to 
RS_ZK_REGION_SPLIT
2011-11-15 22:28:57,775 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:62023-0x132f043bbde08d3 Successfully transitioned node 
e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLITTING to 
RS_ZK_REGION_SPLIT
2011-11-15 22:28:57,775 INFO 
org.apache.hadoop.hbase.regionserver.SplitTransaction: Still waiting on the 
master to process the split for e5be6551c8584a6a1065466e520faf4e
2011-11-15 22:28:57,876 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:62023-0x132f043bbde08d3 Attempting to transition node 
e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
2011-11-15 22:28:57,967 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:62023-0x132f043bbde08d3 Successfully transitioned node 
e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
2011-11-15 22:28:58,067 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:62023-0x132f043bbde08d3 Attempting to transition node 
e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
2011-11-15 22:28:58,108 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: 
regionserver:62023-0x132f043bbde08d3 Successfully transitioned node 
e5be6551c8584a6a1065466e520faf4e from RS_ZK_REGION_SPLIT to RS_ZK_REGION_SPLIT
(printed forever)
{quote}

Since the znode isn't really deleted, it thinks the master just haven't got to 
process its region thus waits which leaves the region *unavailable*.

We need to just retry the delete master-side ASAP since the RS will wait 100ms 
between retries.

At the same time, it would be nice if ZKAssign.deleteNode always printed out 
the name of the region in its messages because it took me a while to see that 
the delete didn't take affect while looking at a grep.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to