[ https://issues.apache.org/jira/browse/TWILL-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852063#comment-15852063 ]
ASF GitHub Bot commented on TWILL-181: -------------------------------------- Github user hsaputra commented on a diff in the pull request: https://github.com/apache/twill/pull/23#discussion_r99416151 --- Diff: twill-yarn/src/main/java/org/apache/twill/internal/appmaster/RunningContainers.java --- @@ -477,10 +493,30 @@ void handleCompleted(YarnContainerStatus status, Multiset<String> restartRunnabl } } - private boolean shouldRetry(int exitCode) { - return exitCode != ContainerExitCodes.SUCCESS - && exitCode != ContainerExitCodes.DISKS_FAILED - && exitCode != ContainerExitCodes.INIT_FAILED; + private boolean shouldRetry(String runnableName, int instanceId, int exitCode) { + boolean possiblyRetry = + exitCode != ContainerExitCodes.SUCCESS && + exitCode != ContainerExitCodes.DISKS_FAILED && + exitCode != ContainerExitCodes.INIT_FAILED; + + if (possiblyRetry) { + int max = getMaxRetries(runnableName); + if (max == Integer.MAX_VALUE) { + return true; // retry without special log msg + } + + if (getRetryCount(runnableName, instanceId) == max) { --- End diff -- Since we call `getRetryCount` for this check, might as well cache it as local var to make sure we got the right one for this request: ``` int retryCount = getRetryCount(runnableName, instanceId); if (getRetryCount(runnableName, instanceId) == max) { ... } else { LOG.info("Attempting {} of {} retries for instance {} of runnable {}.", retryCount + 1, max, instanceId, runnableName); return true; } ``` > Control the maximum number of retries for failed application starts > ------------------------------------------------------------------- > > Key: TWILL-181 > URL: https://issues.apache.org/jira/browse/TWILL-181 > Project: Apache Twill > Issue Type: Improvement > Components: yarn > Affects Versions: 0.7.0-incubating > Reporter: Martin Serrano > Assignee: Martin Serrano > Fix For: 0.10.0 > > > If an application consistently exits with a non-zero code, twill will > attempt to restart indefinitely. I ran into this issue and a list search > also reveals [others| http://markmail.org/message/dehx7r6tpqgcmjh4]. > There should be a mechanism to specify the maximum number of retries until > the application fails. Ideally by default there would be a non-infinite > maximum. -- This message was sent by Atlassian JIRA (v6.3.15#6346)