[
https://issues.apache.org/jira/browse/HBASE-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793492#comment-13793492
]
stack commented on HBASE-9750:
------------------------------
Here is the error we saw that the edit to hbase-daemon.sh is supposed to
address:
{code}
2013-10-11 13:46:28,240 INFO [Thread-6] hbase.HBaseCluster: Starting RS on:
a1806.halxg.cloudera.com
2013-10-11 13:46:28,240 INFO [Thread-6] hbase.ClusterManager: Executing remote
command: /opt/hbase/current/bin/../bin/hbase-daemon.sh start regionserver ,
hostname:a1806.halxg.cloudera.com
2013-10-11 13:46:28,240 INFO [Thread-6] util.Shell: Executing full command
[/usr/bin/ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no
a1806.halxg.cloudera.com "/opt/hbase/current/bin/../bin/hbase-daemon.sh start
regionserver"]
2013-10-11 13:46:30,154 WARN [Thread-6] policies.Policy: Exception occured
during performing action: org.apache.hadoop.util.Shell$ExitCodeException: head:
cannot open
`/opt/hbase/current/bin/../logs/hbase-hbase-regionserver-a1806.halxg.cloudera.com.out'
for reading: No such file or directory
at org.apache.hadoop.util.Shell.runCommand(Shell.java:458)
at org.apache.hadoop.util.Shell.run(Shell.java:373)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:578)
at
org.apache.hadoop.hbase.HBaseClusterManager$RemoteShell.execute(HBaseClusterManager.java:111)
at
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:187)
at
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:196)
at
org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:201)
at
org.apache.hadoop.hbase.DistributedHBaseCluster.startRegionServer(DistributedHBaseCluster.java:104)
at
org.apache.hadoop.hbase.chaos.actions.BatchRestartRsAction.perform(BatchRestartRsAction.java:60)
at
org.apache.hadoop.hbase.chaos.policies.PeriodicRandomActionPolicy.runOneIteration(PeriodicRandomActionPolicy.java:59)
at
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
at
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
at java.lang.Thread.run(Thread.java:724)
{code}
> Add retries around Action server stop/start
> -------------------------------------------
>
> Key: HBASE-9750
> URL: https://issues.apache.org/jira/browse/HBASE-9750
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Assignee: Enis Soztutar
>
> These can fail on occasion (my upping ConnectionTimeout is not enough). Lets
> just retry a few times at least rather than fail at least for server start.
> Losing a server makes tests run for longer and there is also the danger we
> could lose all servers and the long-running test would then outright fail.
--
This message was sent by Atlassian JIRA
(v6.1#6144)