Master failing when node disconnects or dies
--------------------------------------------

                 Key: HBASE-3442
                 URL: https://issues.apache.org/jira/browse/HBASE-3442
             Project: HBase
          Issue Type: Bug
          Components: master, regionserver
    Affects Versions: 0.90.0
         Environment: CentOS 5, Hbase .90 RC3, Amazon EC2
            Reporter: Justin
            Priority: Minor


We've got our servers running on Amazon EC2 and nodes will go through some 
shutdown scripts if/when we want to take them out of the mix.  Ended up 
shutting down one of the nodes, in this case Node98, which cased the immediate 
crash of the master server.  Upon restarting the master, it would attempt to 
contact the missing node, and then stop it's startup process.  I believe the 
node removed itself from the DNS server first, then ran a stop on the datanode, 
and regionserver.  The missing node was also removed from any 
slave/regionserver list on the master server.  I finally put in a bogus entry 
in the /etc/hosts file for the missing node, pointing it back to 127.0.0.1, and 
the master server finally marked it as a dead node, ignored it, and finished 
the startup process.

Going to try and replicate it again and save some more logs, the following log 
is the only thing I saved from the first occurrence;  It's the master failing 
to start up while checking for the missing node:  http://pastebin.com/ZyQMQm91

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to