[ 
https://issues.apache.org/jira/browse/HBASE-9563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792053#comment-13792053
 ] 

Enis Soztutar commented on HBASE-9563:
--------------------------------------

Today, we also saw a related thing in one of the tests. 

CM killed the active master, but was not able to start that back again:  
{code}
2013-10-09 11:27:11,911|beaver.machine|INFO|2013-10-09 11:27:11,910 INFO  
[Thread-6] actions.Action: Performing action: Restart active master
2013-10-09 11:27:11,918|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  
[Thread-6] actions.Action: Killing 
master:ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:11,919|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  
[Thread-6] hbase.HBaseCluster: Aborting Master: 
ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:11,919|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  
[Thread-6] hbase.ClusterManager: Executing remote command: ps aux | grep 
proc_master | grep -v grep | tr -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL 
, hostname:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:11,919|beaver.machine|INFO|2013-10-09 11:27:11,918 INFO  
[Thread-6] util.Shell: Executing full command [/usr/bin/ssh  -o 
StrictHostKeyChecking=no ambari-unsec-1381298460-hbase-1.cs1cloud.internal 
"sudo su - hbase -c \"ps aux | grep proc_master | grep -v grep | tr -s ' ' | 
cut -d ' ' -f2 | xargs kill -s SIGKILL\""]
2013-10-09 11:27:16,684|beaver.machine|INFO|2013-10-09 11:27:16,683 INFO  
[Thread-6] hbase.ClusterManager: Executed remote command, exit code:0 , output:
2013-10-09 11:27:16,685|beaver.machine|INFO|2013-10-09 11:27:16,683 INFO  
[Thread-6] hbase.HBaseCluster: Waiting service:master to stop: 
ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:16,685|beaver.machine|INFO|2013-10-09 11:27:16,683 INFO  
[Thread-6] hbase.ClusterManager: Executing remote command: ps aux | grep 
proc_master | grep -v grep | tr -s ' ' | cut -d ' ' -f2 , 
hostname:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:16,685|beaver.machine|INFO|2013-10-09 11:27:16,684 INFO  
[Thread-6] util.Shell: Executing full command [/usr/bin/ssh  -o 
StrictHostKeyChecking=no ambari-unsec-1381298460-hbase-1.cs1cloud.internal 
"sudo su - hbase -c \"ps aux | grep proc_master | grep -v grep | tr -s ' ' | 
cut -d ' ' -f2\""]
2013-10-09 11:27:17,183|beaver.machine|INFO|2013-10-09 11:27:17,183 INFO  
[Thread-6] hbase.ClusterManager: Executed remote command, exit code:0 , output:
2013-10-09 11:27:17,184|beaver.machine|INFO|2013-10-09 11:27:17,183 INFO  
[Thread-6] actions.Action: Killed master 
server:ambari-unsec-1381298460-hbase-1.cs1cloud.internal,60000,1381340537392
2013-10-09 11:27:17,184|beaver.machine|INFO|2013-10-09 11:27:17,183 INFO  
[Thread-6] actions.Action: Sleeping for:5000

2013-10-09 11:27:22,184|beaver.machine|INFO|2013-10-09 11:27:22,183 INFO  
[Thread-6] actions.Action: Starting 
master:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:22,184|beaver.machine|INFO|2013-10-09 11:27:22,183 INFO  
[Thread-6] hbase.HBaseCluster: Starting Master on: 
ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:22,185|beaver.machine|INFO|2013-10-09 11:27:22,184 INFO  
[Thread-6] hbase.ClusterManager: Executing remote command: 
/usr/lib/hbase/bin/../bin/hbase-daemon.sh --config /tmp/hbaseConf start master 
, hostname:ambari-unsec-1381298460-hbase-1.cs1cloud.internal
2013-10-09 11:27:22,185|beaver.machine|INFO|2013-10-09 11:27:22,184 INFO  
[Thread-6] util.Shell: Executing full command [/usr/bin/ssh  -o 
StrictHostKeyChecking=no ambari-unsec-1381298460-hbase-1.cs1cloud.internal 
"sudo su - hbase -c \"/usr/lib/hbase/bin/../bin/hbase-daemon.sh --config 
/tmp/hbaseConf start master\""]
2013-10-09 11:27:23,006|beaver.machine|INFO|2013-10-09 11:27:23,006 WARN  
[Thread-6] policies.Policy: Exception occured during performing action: 
org.apache.hadoop.util.Shell$ExitCodeException:
2013-10-09 11:27:23,007|beaver.machine|INFO|at 
org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
2013-10-09 11:27:23,007|beaver.machine|INFO|at 
org.apache.hadoop.util.Shell.run(Shell.java:379)
2013-10-09 11:27:23,007|beaver.machine|INFO|at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
2013-10-09 11:27:23,007|beaver.machine|INFO|at 
org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111)
2013-10-09 11:27:23,007|beaver.machine|INFO|at 
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187)
2013-10-09 11:27:23,008|beaver.machine|INFO|at 
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:196)
2013-10-09 11:27:23,008|beaver.machine|INFO|at 
org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:201)
2013-10-09 11:27:23,008|beaver.machine|INFO|at 
org.apache.hadoop.hbase.DistributedHBaseCluster.startMaster(DistributedHBaseCluster.java:148)
2013-10-09 11:27:23,008|beaver.machine|INFO|at 
org.apache.hadoop.hbase.chaos.actions.Action.startMaster(Action.java:77)
2013-10-09 11:27:23,008|beaver.machine|INFO|at 
org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.restartMaster(RestartActionBaseAction.java:45)
2013-10-09 11:27:23,009|beaver.machine|INFO|at 
org.apache.hadoop.hbase.chaos.actions.RestartActiveMasterAction.perform(RestartActiveMasterAction.java:35)
2013-10-09 11:27:23,009|beaver.machine|INFO|at 
org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59)
2013-10-09 11:27:23,009|beaver.machine|INFO|at 
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
2013-10-09 11:27:23,009|beaver.machine|INFO|at 
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
2013-10-09 11:27:23,009|beaver.machine|INFO|at 
java.lang.Thread.run(Thread.java:722)
{code}

>From the master logs you can see that it is not started at all: 
{code}
2013-10-09 11:26:33,458 INFO  [AM.ZK.Worker-pool2-t365] master.RegionStates: 
Onlined 5ec605643a410b2f38dfb5e9806b0f87 on 
ambari-unsec-1381298460-hbase-7.cs1cloud.internal,60020,1381342526428
2013-10-09 11:26:33,458 INFO  [AM.ZK.Worker-pool2-t365] master.RegionStates: 
Offlined 5ec605643a410b2f38dfb5e9806b0f87 from 
ambari-unsec-1381298460-hbase-3.cs1cloud.internal,60020,1381342212884
2013-10-09 11:26:51,906 DEBUG [RpcServer.handler=4,port=60000] master.HMaster: 
Skipping move of region 
IntegrationTestBigLinkedList,\xEF\xF0\xB0\xED\xBC\xA4\xB8\x88\xB6\x18\x8C\xD7nd%\x9B,1381339413686.61ee6a62bd598c8d3846b2348a48bee2.
 because region already assigned to the same server 
ambari-unsec-1381298460-hbase-4.cs1cloud.internal,60020,1381342368480.
2013-10-09 11:27:41,642 INFO  [main] util.VersionInfo: HBase 
0.96.0.2.0.6.0-67-hadoop2
<removed>
2013-10-09 11:27:47,950 INFO  [main] zookeeper.RecoverableZooKeeper: Process 
identifier=clean znode for master connecting to ZooKeeper 
ensemble=ambari-unsec-1381298460-hbase-3.cs1cloud.internal:2181,ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181,ambari-unsec-1381298460-hbase-4.cs1cloud.internal:2181
2013-10-09 11:27:48,117 INFO  
[main-SendThread(ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181)] 
zookeeper.ClientCnxn: Opening socket connection to server 
ambari-unsec-1381298460-hbase-1.cs1cloud.internal/172.18.145.170:2181. Will not 
attempt to authenticate using SASL (unknown error)
2013-10-09 11:27:48,241 INFO  
[main-SendThread(ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181)] 
zookeeper.ClientCnxn: Socket connection established to 
ambari-unsec-1381298460-hbase-1.cs1cloud.internal/172.18.145.170:2181, 
initiating session
2013-10-09 11:27:48,408 INFO  
[main-SendThread(ambari-unsec-1381298460-hbase-1.cs1cloud.internal:2181)] 
zookeeper.ClientCnxn: Session establishment complete on server 
ambari-unsec-1381298460-hbase-1.cs1cloud.internal/172.18.145.170:2181, 
sessionid = 0x1419d4bb0cb0138, negotiated timeout = 30000
{code}

In hbase-daemon.sh, we wait for the process to exit, then call clean for znode: 
{code}
(internal_start)
    # Add to the command log file vital stats on our environment.
    echo "`date` Starting $command on `hostname`" >> $loglog
    echo "`ulimit -a`" >> $loglog 2>&1
    nice -n $HBASE_NICENESS "$HBASE_HOME"/bin/hbase \
        --config "${HBASE_CONF_DIR}" \
        $command "$@" start >> "$logout" 2>&1 &
    echo $! > $pid
    wait
    cleanZNode
{code}

If you notice the timings for last log from master, and the first log from 
master znode task, you would see that former is at 11:26:51,906, and the latter 
is at 11:27:41,642 which is like 50 sec. In the CM logs, you can see that CM 
tried to start the master at 11:27:22,184. It seems that we kill -9 the master 
process, and ps does not show it when we connect for waitForMasterToStop(), but 
indeed the pid lingers on, and the master start on the next run will fail. We 
do not print out the process stdout in case of an error. 

> Autorestart doesn't work if zkcleaner fails
> -------------------------------------------
>
>                 Key: HBASE-9563
>                 URL: https://issues.apache.org/jira/browse/HBASE-9563
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Elliott Clark
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.98.0, 0.96.0
>
>         Attachments: 9563.txt, 9563v2.txt, categorization.txt, 
> categorization.txt
>
>
> I've seen this several times where a master didn't autorestart because zk 
> cleaner failed.  We should still restart the daemon even if it's not possible 
> to clean the zk nodes.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to