[
https://issues.apache.org/jira/browse/HBASE-12464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Stephen Yuan Jiang updated HBASE-12464:
---------------------------------------
Attachment: HBASE-12464.v1-2.0.patch
This V1 patch prevents meta table region go into the FAILED_OPEN state when
maximumAttempts reaches (Note: the smallest maximumAttempts is 1).
AssignmentManager::assign() would keep retrying until the meta table region is
successfully assigned.
> meta table region assignment stuck in the FAILED_OPEN state due to region
> server not fully ready to serve
> ---------------------------------------------------------------------------------------------------------
>
> Key: HBASE-12464
> URL: https://issues.apache.org/jira/browse/HBASE-12464
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 1.0.0, 2.0.0, 0.99.1
> Reporter: Stephen Yuan Jiang
> Assignee: Stephen Yuan Jiang
> Fix For: 1.0.0, 2.0.0
>
> Attachments: HBASE-12464.v1-2.0.patch
>
> Original Estimate: 3h
> Remaining Estimate: 3h
>
> meta table region assignment could reach to the 'FAILED_OPEN' state, which
> makes the region not available unless the target region server shutdown or
> manual resolution. This is undesirable state for meta tavle region.
> Here is the sequence how this could happen (the code is in
> AssignmentManager::assign()):
> Step 1: Master detects a region server (RS1) that hosts one meta table region
> is down, it changes the meta region state from 'online' to 'offline'
> Step 2: In a loop (with configuable maximumAttempts count, default is 10, and
> minimal is 1), AssignmentManager tries to find a RS to host the meta table
> region. If there is no RS available, it would loop forver by resetting the
> loop count (!!BUG#1 from this logic - a small bug!!)
> if (region.isMetaRegion()) {
> - try {
> - Thread.sleep(this.sleepTimeBeforeRetryingMetaAssignment);
> - if (i == maximumAttempts) i = 1; // ==> BUG: if
> maximumAttempts is 1, then the loop will end.
> - continue;
> - } catch (InterruptedException e) {
> - ...
> - }
> Step 3: Once a new RS is found (RS2), inside the same loop as Step 2,
> AssignmentManager tries to assign the meta region to RS2 (OFFLINE, RS1 =>
> PENDING_OPEN, RS2). If for some reason that opening the region in RS2 failed
> (eg. the target RS2 is not ready to serve - ServerNotRunningYetException),
> AssignmentManager would change the state from (PENDING_OPEN, RS2) to
> (FAILED_OPEN, RS2). then it would retry (and even change the RS server to go
> to). The retry is up to maximumAttempts. Once the maximumAttempts is
> reached, the meta region will be in the 'FAILED_OPEN' state, unless either
> (1). RS2 shutdown to trigger region assignment again or (2). it is
> reassigned by an operator via HBase Shell.
> Based on the document ( http://hbase.apache.org/book/regions.arch.html ),
> this is by design - "17. For regions in FAILED_OPEN or FAILED_CLOSE states ,
> the master tries to close them again when they are reassigned by an operator
> via HBase Shell.".
> However, this is bad design, espcially for meta table region (it is arguable
> that the design is good for regular table - for this ticket, I am more focus
> on fixing the meta region availablity issue).
> I propose 2 possible fixes:
> Fix#1 (band-aid change): in Step 3, just like Step 2, if the region is a meta
> table region, reset the loop count so that it would not leave the loop with
> meta table region in FAILED_OPEN state.
> Fix#2 (more involved): if a region is in FAILED_OPEN state, we should provide
> a way to automatically trigger AssignmentManager::assign() after a short
> period of time (leaving any region in FAILED_OPEN state or other states like
> 'FAILED_CLOSE' is undesirable, should have some way to retrying and auto-heal
> the region).
> I think at least for 1.0.0, Fix#1 is good enough. We can open a task-type of
> JIRA for Fix#2 in future release.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)