[jira] [Updated] (HBASE-8924) Master Can fail to come up after chaos monkey if the sleep time is too short.

Elliott Clark (JIRA) Wed, 10 Jul 2013 20:28:14 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Elliott Clark updated HBASE-8924:
---------------------------------

    Attachment: hbase-hbase-master-a1805.halxg.cloudera.com.log.gz

Here's the log that contains the failed restart.


Here's the log from the test trying to bring master back up.
{code}
2013-07-10 18:02:06,423 INFO  [pool-1-thread-4] hbase.ClusterManager: Executed 
remote command, exit code:0 , output:
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] util.ChaosMonkey: Killed master 
server:a1805.halxg.cloudera.com,60000,1373500144613
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] util.ChaosMonkey: Sleeping for:0
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] util.ChaosMonkey: Starting 
master:a1805.halxg.cloudera.com
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] hbase.HBaseCluster: Starting 
Master on: a1805.halxg.cloudera.com
2013-07-10 18:02:06,424 INFO  [pool-1-thread-4] hbase.ClusterManager: Executing 
remote command: /opt/hbase/current/bin/../bin/hbase-daemon.sh  start master , 
hostname:a1805.halxg.cloudera.com
2013-07-10 18:02:06,425 INFO  [pool-1-thread-4] util.Shell: Executing full 
command [/usr/bin/ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no 
a1805.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh  start 
master"]
2013-07-10 18:02:06,426 WARN  [pool-1-thread-7] 
client.HConnectionManager$HConnectionImplementation: Checking master connection
com.google.protobuf.ServiceException: 
org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This server is in 
the failed servers list: a1805.halxg.cloudera.com/10.20.200.105:60000
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1589)
        at 
org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1630)
        at 
org.apache.hadoop.hbase.protobuf.generated.MasterMonitorProtos$MasterMonitorService$BlockingStub.isMasterRunning(MasterMonitorProtos.java:3021)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$MasterMonitorServiceState.isMasterRunning(HConnectionManager.java:1273)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(HConnectionManager.java:1916)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterMonitorService(HConnectionManager.java:1866)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2682)
        at 
org.apache.hadoop.hbase.client.HBaseAdmin.getClusterStatus(HBaseAdmin.java:1945)
        at 
org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$AdminCallable.doAction(IntegrationTestMTTR.java:470)
        at 
org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$TimingCallable.call(IntegrationTestMTTR.java:370)
        at 
org.apache.hadoop.hbase.mttr.IntegrationTestMTTR$TimingCallable.call(IntegrationTestMTTR.java:353)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.hadoop.hbase.ipc.RpcClient$FailedServerException: This 
server is in the failed servers list: 
a1805.halxg.cloudera.com/10.20.200.105:60000
        at 
org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupIOstreams(RpcClient.java:828)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.getConnection(RpcClient.java:1455)
        at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1347)
        at 
org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1573)
        ... 15 more
{code}
                
> Master Can fail to come up after chaos monkey if the sleep time is too short.
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-8924
>                 URL: https://issues.apache.org/jira/browse/HBASE-8924
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>         Attachments: hbase-hbase-master-a1805.halxg.cloudera.com.log.gz
>
>
> On a real cluster the master won't come up if the sleep time between killing 
> and starting is too short.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-8924) Master Can fail to come up after chaos monkey if the sleep time is too short.

Reply via email to