region close and open processed out of order; makes for disagreement between 
master and regionserver on region state
--------------------------------------------------------------------------------------------------------------------

                 Key: HBASE-921
                 URL: https://issues.apache.org/jira/browse/HBASE-921
             Project: Hadoop HBase
          Issue Type: Bug
    Affects Versions: 0.18.0
            Reporter: stack
            Priority: Blocker
             Fix For: 0.18.1, 0.19.0


Master assigns region X successfully.  It then decides to close it because it 
wants it opened elsewhere as part of region rebalancing.  Both the open and 
close operations are reported back to the master.  Both have operation 
processing components that are added to the todo list to be processed in 
another thread outside of the master's main loop.

The close operation does the bulk of its work inline with the master main 
processing loop.  Its todo component does some work if the region is offlined 
but otherwise nothing of consequence whereas the open in its todo does the 
important meta catalog table update with the new location information.

Its been fairly common here on our cluster where the master todo queue is 
occupied processing the shutdown of a regionserver.  It takes a long time to 
process the shutdown of a regionserver when thousands of regions   This latter 
delays the processing of the open and close todos.  In effect the open is 
running after the close.  The region goes into limbo.  Only a restart of the 
'hosting' regionserver 'fixes' this state.

This is a particular case of the general HBASE-543 issue.  Its happening alot 
here on our cluster so will hack up a fix for this and get it into TRUNK and 
backport it to 0.18.1.

Jim Firby here had a good idea for conditions like this.  Clients should be 
able to say "I've asked for a regions location 10 times now and Mr. Master, 
you've given me the same response ten times in a row and each time, the answer 
was wrong.  Revisit any notion that said region is at said location".  Mr. 
Master would then go off and do something drastic like close and reassign the 
region.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to