[
https://issues.apache.org/jira/browse/HBASE-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ryan rawson updated HBASE-1457:
-------------------------------
Attachment: HBASE-1457-v4.patch
the latest fix, including:
- make region historian writes into todo queue
- make todo queue a priority queue, putting higher priority items to the top
- ensure double assignment of ROOT/META can't happen
- prevent assignment bugs when the cluster is mis-loaded, and ensure ROOT/META
get assigned as fast as possible to the first server (rather than the best
server as was previously)
-- assignment could get stuck when the 'best' server was unable to contact the
master because the ROOT/META is offline. Very ugly bug.
- reduce how much we retry in pending operations, this can delay recovery
because if the META/ROOT goes down while processing a TODO, the recovery of the
META/ROOT has to wait until the currently running pending operation times out.
This could take over 5 minutes previously (!!). 1 second time outs * 10 * 2-3
per commit() * 2 attempts takes a long time.
- improve a bug where if ROOT was unavailable some pending operations might
fail and not get requeued.
- Handle bugs where a server would go offline and 'forget' to mention that ROOT
or META went offline, thus delaying reassignment. Now we force META/ROOT
offline ASAP and get them reassigned as fast as possible on clean shutdown.
- Improved unclean shutdown handling of META - instead of waiting for the ROOT
scanner to detect a bad assignment and fix it, be more proactive and put the
META to be assigned once log split is finished. This can improve META recovery
time by 5-10 seconds.
- Fixed a rare but deadly NPE in ProcessRegionOpen, improved the handling of
failed todo operations - instead of putting them back into the todo queue, put
them into the delayed queue (since the NPE is a side effect of not having ROOT
online yet).
Yes, All these bugs are incorporated in this relatively small patch. (933 lines
of diff)
> Taking down ROOT/META regionserver can result in cluster becoming
> in-operational
> --------------------------------------------------------------------------------
>
> Key: HBASE-1457
> URL: https://issues.apache.org/jira/browse/HBASE-1457
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.20.0
> Reporter: ryan rawson
> Assignee: ryan rawson
> Fix For: 0.20.0
>
> Attachments: HBASE-1457-v2.patch, HBASE-1457-v3.patch,
> HBASE-1457-v4.patch, HBASE-1457.patch
>
>
> Take down a regionserver via controlled or uncontrolled shutdown, the master
> doesn't properly reassign the root/meta regions.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.