[jira] [Assigned] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Jean-Daniel Cryans (JIRA) Wed, 25 May 2011 13:40:30 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Daniel Cryans reassigned HBASE-3874:
-----------------------------------------

    Assignee: Jean-Daniel Cryans

> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-3874
>                 URL: https://issues.apache.org/jira/browse/HBASE-3874
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.2
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024 
> and it started dying non-stop on "Too many open files". Now the bad thing is 
> that some region servers weren't completely ServerShutdownHandler'd because 
> they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler: 
> Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
>       at 
> org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
>       at 
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>       at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
>   // Use iterator's remove else we'll get CME
>   i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it 
> seems we instantiate a RegionPlan with a null HSI when it's a random 
> assignment. 
> It means that if there's a random assignment going on while a node dies then 
> this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already 
> split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the 
> balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because 
> processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which 
> removes the RS from the dead list if it comes back.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-3874) ServerShutdownHandler fails on NPE if a plan has a random region assignment

Reply via email to