[ 
https://issues.apache.org/jira/browse/HBASE-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jim Kellerman updated HBASE-921:
--------------------------------

    Issue Type: Sub-task  (was: Bug)
        Parent: HBASE-678

> region close and open processed out of order; makes for disagreement between 
> master and regionserver on region state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-921
>                 URL: https://issues.apache.org/jira/browse/HBASE-921
>             Project: Hadoop HBase
>          Issue Type: Sub-task
>    Affects Versions: 0.18.0
>            Reporter: stack
>            Assignee: Jim Kellerman
>            Priority: Blocker
>             Fix For: 0.18.1, 0.19.0
>
>         Attachments: 921-0.18.0.patch
>
>
> Master assigns region X successfully.  It then decides to close it because it 
> wants it opened elsewhere as part of region rebalancing.  Both the open and 
> close operations are reported back to the master.  Both have operation 
> processing components that are added to the todo list to be processed in 
> another thread outside of the master's main loop.
> The close operation does the bulk of its work inline with the master main 
> processing loop.  Its todo component does some work if the region is offlined 
> but otherwise nothing of consequence whereas the open in its todo does the 
> important meta catalog table update with the new location information.
> Its been fairly common here on our cluster where the master todo queue is 
> occupied processing the shutdown of a regionserver.  It takes a long time to 
> process the shutdown of a regionserver when thousands of regions   This 
> latter delays the processing of the open and close todos.  In effect the open 
> is running after the close.  The region goes into limbo.  Only a restart of 
> the 'hosting' regionserver 'fixes' this state.
> This is a particular case of the general HBASE-543 issue.  Its happening alot 
> here on our cluster so will hack up a fix for this and get it into TRUNK and 
> backport it to 0.18.1.
> Jim Firby here had a good idea for conditions like this.  Clients should be 
> able to say "I've asked for a regions location 10 times now and Mr. Master, 
> you've given me the same response ten times in a row and each time, the 
> answer was wrong.  Revisit any notion that said region is at said location".  
> Mr. Master would then go off and do something drastic like close and reassign 
> the region.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to