HuangZhenQiu commented on a change in pull request #8952:
URL: https://github.com/apache/flink/pull/8952#discussion_r550421330
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/ActiveResourceManager.java
##########
@@ -213,14 +235,33 @@ public void
onPreviousAttemptWorkersRecovered(Collection<WorkerType> recoveredWo
}
}
+ /**
+ * Record failure number of worker in ResourceManagers. Return whether
maximum failure rate is
+ * detected.
+ *
+ * @return whether should acquire new container/worker after the a stop
interval
+ */
+ public boolean recordWorkerFailure() {
+ failureRater.markEvent();
+
+ try {
+ failureRater.checkAgainstThreshold();
+ } catch (ThresholdExceedException e) {
+ log.warn(e.getMessage() + " in resource manager failure rater.");
+ return true;
+ }
+
+ return false;
+ }
+
@Override
public void onWorkerTerminated(ResourceID resourceId, String diagnostics) {
if (clearStateForWorker(resourceId)) {
log.info(
"Worker {} is terminated. Diagnostics: {}",
resourceId.getStringWithMetadata(),
diagnostics);
- requestWorkerIfRequired();
+ recordWorkerFailureAndPauseWorkerCreationIfNeeded();
Review comment:
Yes, we should record failure only for
currentAttemptUnregisteredWorkers.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]