[
https://issues.apache.org/jira/browse/OOZIE-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16493556#comment-16493556
]
Andras Piros commented on OOZIE-3156:
-------------------------------------
Thanks for the contribution [~txsing]! Can you please update
{{TestSshActionExecutor}} with a new test case covering retry functionality, as
well as extend {{DG_SshActionExtension.twiki}} to document the fix?
Review comments:
* {{SSH_CONNECT_ERROR_CODE}} could be {{final}}
* {{retriesMax}} should be {{retryCount}}
* in order to actually have a chance that the connection error doesn't reoccur,
we should {{Thread#sleep()}} some time in between, or use an
[*{{ScheduledThreadPoolExecutor}}*|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ScheduledThreadPoolExecutor.html]
to perform waiting without busy waiting
* the waiting should be based on an exponential backoff like in
[*{{OperationRetryHandler#handleRetry()}}*|https://github.com/apache/oozie/blob/master/core/src/main/java/org/apache/oozie/util/db/OperationRetryHandler.java#L123-L129]
> SSH action status turns OK wrongly when failed to connect to host
> -----------------------------------------------------------------
>
> Key: OOZIE-3156
> URL: https://issues.apache.org/jira/browse/OOZIE-3156
> Project: Oozie
> Issue Type: Bug
> Components: action
> Affects Versions: 5.0.0
> Reporter: TIAN XING
> Assignee: TIAN XING
> Priority: Major
> Attachments: ssh-check-bug.patch
>
>
> When {{check()}} method of {{SshActionExecutor}} gets invoked, oozie will ssh
> connect to the host and check whether the pid of the process that ssh action
> started is still there (by checking the returned value of command "{{ssh
> <host-ip> ps -p <pid>}}" ) to determine whether ssh action completes or not.
> However, we found cases where oozie fails to connect to host during action
> status check (e.g., the host is under heavy load, or network is bad etc.).
> In such cases, the return value of command "{{ssh <host-ip> ps -p <pid>}}"
> will be 255 (ssh command exits with the exit status of the remote command or
> with 255 if an error occurred.).
> According the current logic of method {{getActionStatus()}} in
> {{SshActionExecutor}}, the action status will be determined as OK which may
> not be correct.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)