[
https://issues.apache.org/jira/browse/HBASE-3874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans reassigned HBASE-3874:
-----------------------------------------
Assignee: Jean-Daniel Cryans
> ServerShutdownHandler fails on NPE if a plan has a random region assignment
> ---------------------------------------------------------------------------
>
> Key: HBASE-3874
> URL: https://issues.apache.org/jira/browse/HBASE-3874
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.2
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Priority: Blocker
> Fix For: 0.90.4
>
> Attachments: HBASE-3874-trunk.patch, HBASE-3874.patch
>
>
> By chance, we were able to revert the ulimit on one of our clusters to 1024
> and it started dying non-stop on "Too many open files". Now the bad thing is
> that some region servers weren't completely ServerShutdownHandler'd because
> they failed on:
> {quote}
> 2011-05-07 00:04:46,203 ERROR org.apache.hadoop.hbase.executor.EventHandler:
> Caught throwable while processing event M_SERVER_SHUTDOWN
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1804)
> at
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101)
> at
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:662)
> {quote}
> Reading the code, it seems the NPE is in the if statement:
> {code}
> Map.Entry<String, RegionPlan> e = i.next();
> if (e.getValue().getDestination().equals(hsi)) {
> // Use iterator's remove else we'll get CME
> i.remove();
> }
> {code}
> Which means that the destination (HSI) is null. Looking through the code, it
> seems we instantiate a RegionPlan with a null HSI when it's a random
> assignment.
> It means that if there's a random assignment going on while a node dies then
> this issue might happen.
> Initially I thought that this could mean data loss, but the logs are already
> split so it's just the reassignment that doesn't happen (still bad).
> Also it left the master with dead server being processed, so for two days the
> balancer didn't run failing on:
> bq. org.apache.hadoop.hbase.master.HMaster: Not running balancer because
> processing dead regionserver(s): []
> And the reason why the array is empty is because we are running 0.90.3 which
> removes the RS from the dead list if it comes back.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira