[
https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Purtell updated HBASE-5202:
----------------------------------
Resolution: Incomplete
Assignee: (was: Eugene Koontz)
Status: Resolved (was: Patch Available)
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
> Key: HBASE-5202
> URL: https://issues.apache.org/jira/browse/HBASE-5202
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.6
> Reporter: Eugene Koontz
> Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL
> [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708]
> master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
> at
> org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
> at
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
> at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second
> parameter).
> The AssignmentManager's processFailover() method is passing a null to
> regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi =
> this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>
> {code}
> hsi =
> this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
> public HServerInfo getHServerInfo(final HServerAddress hsa) {
> synchronized(this.onlineServers) {
> // TODO: This is primitive. Do a better search.
> for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
> if (e.getValue().getServerAddress().equals(hsa)) {
> return e.getValue();
> }
> }
> }
> return null;
> }
> {code}
> This will return null if the onlineServers map does not yet have a value
> corresponding to the key supplied by the catalogTracker's getRootLocation()
> or getMetaLocation().
> Since the catalogTracker uses zookeeper to establish the server locations of
> {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to
> the these servers' registering with the master, there can be an inconsistency
> between the catalogTracker and the onlineServers if either of these
> regionservers is online with respect to zookeeper, but haven't yet registered
> with the master (perhaps due to a high latency network between the master and
> the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify
> TestMasterFailover to cause this NPE.
> The proposed fix (provided along with the above test in a separate
> attachment) is for the master to use the new verifyMetaTablesAreUp() to wait
> for both of the servers named by the catalog tracker's getRootLocation() and
> getMetaLocation() to register with the master before the master can continue
> with failover.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)