[
https://issues.apache.org/jira/browse/HBASE-2486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eugene Koontz updated HBASE-2486:
---------------------------------
Status: Patch Available (was: Open)
Release Note:
Adds new property "hbase.master.sanitychecking" which determines how master
should handle situations where the master believes a region is hosted by a
certain regionserver, but that regionserver indicates by throwing a 'No Such
Region' exception, that it does not serve that region:
lax - mark region as unassigned and continue
paranoid - shut down master
Note that this patch is against 0.20.3; I can create another patch against
0.20.4 and/or other tags if desired by reviewer.
> Add simple "anti-entropy" for region assignment
> -----------------------------------------------
>
> Key: HBASE-2486
> URL: https://issues.apache.org/jira/browse/HBASE-2486
> Project: Hadoop HBase
> Issue Type: Improvement
> Components: master, regionserver
> Affects Versions: 0.20.5
> Reporter: Todd Lipcon
> Assignee: Eugene Koontz
> Fix For: 0.21.0
>
>
> We've seen a number of bugs where a region server thinks it should not be
> serving a region, but the master and META think it should be. I'd like to
> propose a very simple way of fixing this issue:
> 1) whenever a regionserver throws a NotServingRegionException, it also marks
> that region id in an RS-wide Set
> 2) when a region sends a heartbeat, include a message for each of these
> regions, MSG_REPORT_NSRE or somesuch, and then clear the set
> 3) when the master receives MSG_REPORT_NSRE, it does the following checks:
> a) if the region is assigned elsewhere according to META, the NSRE was due to
> a stale client, ignore
> b) if the region is in transition, ignore
> c) otherwise, we have an inconsistency, and we should take some steps to
> resolve (eg mark the region unassigned, or exit the master if we are in
> "paranoid mode")
> Whatever we do, we need to make sure that this is loudly logged, and causes
> unit tests to fail, when it's detected. This should *not* happen, but when it
> does, it would be good to recover without addtable.rb, etc.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.