Master failing when node disconnects or dies
--------------------------------------------
Key: HBASE-3442
URL: https://issues.apache.org/jira/browse/HBASE-3442
Project: HBase
Issue Type: Bug
Components: master, regionserver
Affects Versions: 0.90.0
Environment: CentOS 5, Hbase .90 RC3, Amazon EC2
Reporter: Justin
Priority: Minor
We've got our servers running on Amazon EC2 and nodes will go through some
shutdown scripts if/when we want to take them out of the mix. Ended up
shutting down one of the nodes, in this case Node98, which cased the immediate
crash of the master server. Upon restarting the master, it would attempt to
contact the missing node, and then stop it's startup process. I believe the
node removed itself from the DNS server first, then ran a stop on the datanode,
and regionserver. The missing node was also removed from any
slave/regionserver list on the master server. I finally put in a bogus entry
in the /etc/hosts file for the missing node, pointing it back to 127.0.0.1, and
the master server finally marked it as a dead node, ignored it, and finished
the startup process.
Going to try and replicate it again and save some more logs, the following log
is the only thing I saved from the first occurrence; It's the master failing
to start up while checking for the missing node: http://pastebin.com/ZyQMQm91
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.