[ 
https://issues.apache.org/jira/browse/HBASE-5202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186609#comment-13186609
 ] 

Eugene Koontz commented on HBASE-5202:
--------------------------------------

Jinchao wrote (in 
[HBASE-5202|https://issues.apache.org/jira/browse/HBASE-3933?focusedCommentId=13186098&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13186098]):

{quote}
@Eugene
In your patches, You only deale with the root/meta regionserver. If a normal 
regionserver registers laterly.
Master will process it as a dead one. Some regions in the later one will be 
opened twice.
{quote}

Jinchao, can you explain this scenario more? Does my patch cause duplicate 
openings that could not happen before? Or are you saying that this patch does 
not fix the existing NPE described on HBASE-3933?
                
> NPE during Master failover in master.AssignmentManager.regionOnline()
> ---------------------------------------------------------------------
>
>                 Key: HBASE-5202
>                 URL: https://issues.apache.org/jira/browse/HBASE-5202
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.6
>            Reporter: Eugene Koontz
>            Assignee: Eugene Koontz
>         Attachments: HBASE-5202.patch, testMasterFailoverWithSlowRS.txt
>
>
> The following NPE can occur during master failover:
> {code}
> 2012-01-15 17:45:00,314 FATAL 
> [Master:1;ip-10-166-123-193.us-west-1.compute.internal:36708] 
> master.HMaster(944): Unhandled exception. Starting shutdown.
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:724)
>         at 
> org.apache.hadoop.hbase.master.AssignmentManager.processFailover(AssignmentManager.java:214)
>         at 
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:396)
>         at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:279)
>         at java.lang.Thread.run(Thread.java:636)
> {code}
> This is caused by regionOnline() being passed a null serverInfo (its second 
> parameter). 
> The AssignmentManager's processFailover() method is passing a null to 
> regionOnline() because the value that regionOnline is passing, hsi, is set as:
> {code}
> hsi = 
> this.serverManager.getHServerInfo(this.catalogTracker.getMetaLocation());
> {code}
> and
>  
> {code}
> hsi = 
> this.serverManager.getHServerInfo(this.catalogTracker.getRootLocation());
> {code}
> getHServerInfo() is defined as:
> {code}
>   public HServerInfo getHServerInfo(final HServerAddress hsa) {
>     synchronized(this.onlineServers) {
>       // TODO: This is primitive.  Do a better search.
>       for (Map.Entry<String, HServerInfo> e: this.onlineServers.entrySet()) {
>         if (e.getValue().getServerAddress().equals(hsa)) {
>           return e.getValue();
>         }
>       }
>     }
>     return null;
>   }
> {code}
> This will return null if the onlineServers map does not yet have a value 
> corresponding to the key supplied by the catalogTracker's getRootLocation() 
> or getMetaLocation(). 
> Since the catalogTracker uses zookeeper to establish the server locations of 
> {{-ROOT-}} and {{.META.}}, while the onlineServers map is set according to 
> the these servers' registering with the master, there can be an inconsistency 
> between the catalogTracker and the onlineServers if either of these 
> regionservers is online with respect to zookeeper, but haven't yet registered 
> with the master (perhaps due to a high latency network between the master and 
> the regionserver).
> The attached testMasterFailoverWithSlowRS.txt patch can be used to modify 
> TestMasterFailover to cause this NPE. 
> The proposed fix (provided along with the above test in a separate 
> attachment) is for the master to use the new verifyMetaTablesAreUp() to wait 
> for both of the servers named by the catalog tracker's getRootLocation() and 
> getMetaLocation() to register with the master before the master can continue 
> with failover.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to