[
https://issues.apache.org/jira/browse/HBASE-9750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Enis Soztutar updated HBASE-9750:
---------------------------------
Attachment: hbase-9750_v1.patch
Here is an attempt for retrying actions from HBaseClusterManager.
It seems to do it's job:
{code}
2013-10-29 11:29:02,656 INFO [main] hbase.HBaseClusterManager(199): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
2013-10-29 11:29:03,573 WARN [main] hbase.HBaseClusterManager(239): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
failed at attempt 0. Retrying until maxAttempts: 5. Exception: stderr: bash:
null/bin/hbase-daemon.sh: No such file or directory
, stdout:
2013-10-29 11:29:03,573 INFO [main] util.RetryCounter(155): Sleeping 1000ms
before retry #0...
2013-10-29 11:29:04,577 INFO [main] hbase.HBaseClusterManager(199): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
2013-10-29 11:29:04,695 WARN [main] hbase.HBaseClusterManager(239): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
failed at attempt 1. Retrying until maxAttempts: 5. Exception: stderr: bash:
null/bin/hbase-daemon.sh: No such file or directory
, stdout:
2013-10-29 11:29:04,695 INFO [main] util.RetryCounter(155): Sleeping 2000ms
before retry #1...
2013-10-29 11:29:06,696 INFO [main] hbase.HBaseClusterManager(199): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
2013-10-29 11:29:06,831 WARN [main] hbase.HBaseClusterManager(239): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
failed at attempt 2. Retrying until maxAttempts: 5. Exception: stderr: bash:
null/bin/hbase-daemon.sh: No such file or directory
, stdout:
2013-10-29 11:29:06,831 INFO [main] util.RetryCounter(155): Sleeping 4000ms
before retry #2...
2013-10-29 11:29:10,832 INFO [main] hbase.HBaseClusterManager(199): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
2013-10-29 11:29:10,922 WARN [main] hbase.HBaseClusterManager(239): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
failed at attempt 3. Retrying until maxAttempts: 5. Exception: stderr: bash:
null/bin/hbase-daemon.sh: No such file or directory
, stdout:
2013-10-29 11:29:10,922 INFO [main] util.RetryCounter(155): Sleeping 8000ms
before retry #3...
2013-10-29 11:29:18,923 INFO [main] hbase.HBaseClusterManager(199): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
2013-10-29 11:29:19,017 WARN [main] hbase.HBaseClusterManager(239): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
failed at attempt 4. Retrying until maxAttempts: 5. Exception: stderr: bash:
null/bin/hbase-daemon.sh: No such file or directory
, stdout:
2013-10-29 11:29:19,017 INFO [main] util.RetryCounter(155): Sleeping 16000ms
before retry #4...
2013-10-29 11:29:35,018 INFO [main] hbase.HBaseClusterManager(199): Executing
remote command: null/bin/hbase-daemon.sh start master , hostname:localhost
Exception in thread "main" org.apache.hadoop.util.Shell$ExitCodeException:
stderr: bash: null/bin/hbase-daemon.sh: No such file or directory
, stdout:
at
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:208)
at
org.apache.hadoop.hbase.HBaseClusterManager.execWithRetries(HBaseClusterManager.java:223)
at
org.apache.hadoop.hbase.HBaseClusterManager.exec(HBaseClusterManager.java:248)
at
org.apache.hadoop.hbase.HBaseClusterManager.start(HBaseClusterManager.java:253)
at
org.apache.hadoop.hbase.HBaseClusterManager.main(HBaseClusterManager.java:282)
{code}
> Add retries around Action server stop/start
> -------------------------------------------
>
> Key: HBASE-9750
> URL: https://issues.apache.org/jira/browse/HBASE-9750
> Project: HBase
> Issue Type: Bug
> Components: test
> Reporter: stack
> Assignee: Enis Soztutar
> Fix For: 0.98.0, 0.96.1
>
> Attachments: hbase-9750_v1.patch
>
>
> These can fail on occasion (my upping ConnectionTimeout is not enough). Lets
> just retry a few times at least rather than fail at least for server start.
> Losing a server makes tests run for longer and there is also the danger we
> could lose all servers and the long-running test would then outright fail.
--
This message was sent by Atlassian JIRA
(v6.1#6144)