[jira] Updated: (HBASE-1457) Taking down ROOT/META regionserver can result in cluster becoming in-operational

ryan rawson (JIRA) Sun, 31 May 2009 01:42:35 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ryan rawson updated HBASE-1457:
-------------------------------

    Attachment: HBASE-1457-v4.patch

the latest fix, including:
- make region historian writes into todo queue
- make todo queue a priority queue, putting higher priority items to the top
- ensure double assignment of ROOT/META can't happen
- prevent assignment bugs when the cluster is mis-loaded, and ensure ROOT/META 
get assigned as fast as possible to the first server (rather than the best 
server as was previously)
-- assignment could get stuck when the 'best' server was unable to contact the 
master because the ROOT/META is offline. Very ugly bug.
- reduce how much we retry in pending operations, this can delay recovery 
because if the META/ROOT goes down while processing a TODO, the recovery of the 
META/ROOT has to wait until the currently running pending operation times out. 
This could take over 5 minutes previously (!!).  1 second time outs * 10 * 2-3 
per commit() * 2 attempts takes a long time.
- improve a bug where if ROOT was unavailable some pending operations might 
fail and not get requeued.
- Handle bugs where a server would go offline and 'forget' to mention that ROOT 
or META went offline, thus delaying reassignment.  Now we force META/ROOT 
offline ASAP and get them reassigned as fast as possible on clean shutdown.
- Improved unclean shutdown handling of META - instead of waiting for the ROOT 
scanner to detect a bad assignment and fix it, be more proactive and put the 
META to be assigned once log split is finished.  This can improve META recovery 
time by 5-10 seconds.
- Fixed a rare but deadly NPE in ProcessRegionOpen, improved the handling of 
failed todo operations - instead of putting them back into the todo queue, put 
them into the delayed queue (since the NPE is a side effect of not having ROOT 
online yet).

Yes, All these bugs are incorporated in this relatively small patch. (933 lines 
of diff)  


> Taking down ROOT/META regionserver can result in cluster becoming 
> in-operational
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-1457
>                 URL: https://issues.apache.org/jira/browse/HBASE-1457
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch, 
> HBASE-1457-v4.patch, HBASE-1457.patch
>
>
> Take down a regionserver via controlled or uncontrolled shutdown, the master 
> doesn't properly reassign the root/meta regions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1457) Taking down ROOT/META regionserver can result in cluster becoming in-operational

Reply via email to