rmatharu commented on a change in pull request #1104: SAMZA-2266: Introduce a
backoff when there are repeated failures for host-affinity allocations
URL: https://github.com/apache/samza/pull/1104#discussion_r305993360
##########
File path:
samza-core/src/main/java/org/apache/samza/clustermanager/ContainerProcessManager.java
##########
@@ -366,86 +381,15 @@ public void onResourceCompleted(SamzaResourceStatus
resourceStatus) {
state.jobHealthy.set(false);
// handle container stop due to node fail
- this.handleContainerStop(processorId, resourceStatus.getContainerId(),
ResourceRequestState.ANY_HOST, exitStatus);
+ handleContainerStop(processorId, resourceStatus.getContainerId(),
ResourceRequestState.ANY_HOST, exitStatus, Duration.ZERO);
break;
default:
- log.info("Container ID: {} for Processor ID: {} failed with exit code:
{}.", containerId, processorId, exitStatus);
-
- state.failedContainers.incrementAndGet();
- state.failedContainersStatus.put(containerId, resourceStatus);
- state.jobHealthy.set(false);
-
- state.neededProcessors.incrementAndGet();
- // Find out previously running container location
- String lastSeenOn =
state.jobModelManager.jobModel().getContainerToHostValue(processorId,
SetContainerHostMapping.HOST_KEY);
- if (!hostAffinityEnabled || lastSeenOn == null) {
- lastSeenOn = ResourceRequestState.ANY_HOST;
- }
- log.info("Container ID: {} for Processor ID: {} was last seen on host
{}.", containerId, processorId, lastSeenOn);
- // A container failed for an unknown reason. Let's check to see if
- // we need to shutdown the whole app master if too many container
- // failures have happened. The rules for failing are that the
- // failure count for a task group id must be > the configured retry
- // count, and the last failure (the one prior to this one) must have
- // happened less than retry window ms ago. If retry count is set to
- // 0, the app master will fail on any container failure. If the
- // retry count is set to a number < 0, a container failure will
- // never trigger an app master failure.
- int retryCount = clusterManagerConfig.getContainerRetryCount();
- int retryWindowMs = clusterManagerConfig.getContainerRetryWindowMs();
-
- if (retryCount == 0) {
- log.error("Processor ID: {} (current Container ID: {}) failed, and
retry count is set to 0, " +
- "so shutting down the application master and marking the job as
failed.", processorId, containerId);
-
- tooManyFailedContainers = true;
- } else if (retryCount > 0) {
- int currentFailCount;
- long lastFailureTime;
- if (processorFailures.containsKey(processorId)) {
- ProcessorFailure failure = processorFailures.get(processorId);
- currentFailCount = failure.getCount() + 1;
- lastFailureTime = failure.getLastFailure();
- } else {
- currentFailCount = 1;
- lastFailureTime = 0L;
- }
- if (currentFailCount >= retryCount) {
- long lastFailureMsDiff = System.currentTimeMillis() -
lastFailureTime;
Review comment:
Summarizing the discussion, for anyone looking at this review,
there was a bug where the "fail job if N failures in a M window" was only
looking at the "last failure being in the M window"
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services