[jira] [Updated] (HBASE-9750) Add retries around Action server stop/start

Enis Soztutar (JIRA) Tue, 29 Oct 2013 11:46:03 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Enis Soztutar updated HBASE-9750:
---------------------------------

    Attachment: hbase-9750_v1.patch

Here is an attempt for retrying actions from HBaseClusterManager. 

It seems to do it's job: 
{code}
2013-10-29 11:29:02,656 INFO  [main] hbase.HBaseClusterManager(199): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost
2013-10-29 11:29:03,573 WARN  [main] hbase.HBaseClusterManager(239): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost 
failed at attempt 0. Retrying until maxAttempts: 5. Exception: stderr: bash: 
null/bin/hbase-daemon.sh: No such file or directory
, stdout: 
2013-10-29 11:29:03,573 INFO  [main] util.RetryCounter(155): Sleeping 1000ms 
before retry #0...
2013-10-29 11:29:04,577 INFO  [main] hbase.HBaseClusterManager(199): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost
2013-10-29 11:29:04,695 WARN  [main] hbase.HBaseClusterManager(239): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost 
failed at attempt 1. Retrying until maxAttempts: 5. Exception: stderr: bash: 
null/bin/hbase-daemon.sh: No such file or directory
, stdout: 
2013-10-29 11:29:04,695 INFO  [main] util.RetryCounter(155): Sleeping 2000ms 
before retry #1...
2013-10-29 11:29:06,696 INFO  [main] hbase.HBaseClusterManager(199): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost
2013-10-29 11:29:06,831 WARN  [main] hbase.HBaseClusterManager(239): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost 
failed at attempt 2. Retrying until maxAttempts: 5. Exception: stderr: bash: 
null/bin/hbase-daemon.sh: No such file or directory
, stdout: 
2013-10-29 11:29:06,831 INFO  [main] util.RetryCounter(155): Sleeping 4000ms 
before retry #2...
2013-10-29 11:29:10,832 INFO  [main] hbase.HBaseClusterManager(199): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost
2013-10-29 11:29:10,922 WARN  [main] hbase.HBaseClusterManager(239): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost 
failed at attempt 3. Retrying until maxAttempts: 5. Exception: stderr: bash: 
null/bin/hbase-daemon.sh: No such file or directory
, stdout: 
2013-10-29 11:29:10,922 INFO  [main] util.RetryCounter(155): Sleeping 8000ms 
before retry #3...
2013-10-29 11:29:18,923 INFO  [main] hbase.HBaseClusterManager(199): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost
2013-10-29 11:29:19,017 WARN  [main] hbase.HBaseClusterManager(239): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost 
failed at attempt 4. Retrying until maxAttempts: 5. Exception: stderr: bash: 
null/bin/hbase-daemon.sh: No such file or directory
, stdout: 
2013-10-29 11:29:19,017 INFO  [main] util.RetryCounter(155): Sleeping 16000ms 
before retry #4...
2013-10-29 11:29:35,018 INFO  [main] hbase.HBaseClusterManager(199): Executing 
remote command: null/bin/hbase-daemon.sh  start master , hostname:localhost
Exception in thread "main" org.apache.hadoop.util.Shell$ExitCodeException: 
stderr: bash: null/bin/hbase-daemon.sh: No such file or directory
, stdout: 
        at 
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:208)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.execWithRetries(HBaseClusterManager.java:223)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:248)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:253)
        at 
org.apache.hadoop.hbase.HBaseClusterManager.main(HBaseClusterManager.java:282)
{code}

> Add retries around Action server stop/start
> -------------------------------------------
>
>                 Key: HBASE-9750
>                 URL: https://issues.apache.org/jira/browse/HBASE-9750
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>            Reporter: stack
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0, 0.96.1
>
>         Attachments: hbase-9750_v1.patch
>
>
> These can fail on occasion (my upping ConnectionTimeout is not enough).  Lets 
> just retry a few times at least rather than fail at least for server start.  
> Losing a server makes tests run for longer and there is also the danger we 
> could lose all servers and the long-running test would then outright fail.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HBASE-9750) Add retries around Action server stop/start

Reply via email to