liyude-tw commented on code in PR #26809:
URL: https://github.com/apache/flink/pull/26809#discussion_r2217918195


##########
flink-yarn/src/main/java/org/apache/flink/yarn/configuration/YarnConfigOptions.java:
##########
@@ -110,16 +110,16 @@ public class YarnConfigOptions {
     public static final ConfigOption<Long> 
APPLICATION_ATTEMPT_FAILURE_VALIDITY_INTERVAL =
             key("yarn.application-attempt-failures-validity-interval")
                     .longType()
-                    .defaultValue(10000L)
+                    .defaultValue(-1L)

Review Comment:
   Below is the reasoning that led me to propose -1 and how I believe the 
change is safer and less surprising than the current default.
   
   1. Few users intentionally depend on the current 10 s window
   The 10 s sliding window was introduced in PR #8400 by re-using the 
then-default Akka timeout. It wasn’t added to satisfy a concrete production 
need, so I think almost no one relies on it on purpose. We discover it after 
being surprised by extra restarts.
   
   2. Hadoop YARN’s own default is -1 (global counting)
   Because Flink runs as a YARN ApplicationMaster, aligning with the upstream 
default reduces the cognitive overhead for operators who administer both 
systems.
   
   3. The documentation and common intuition both imply “global counting”
   The description of yarn.application-attempts naturally suggests a total 
attempt limit. A hidden time window can therefore be surprising.
   
   ### Risk-mitigation proposal
   
   1. Upgrade guide
   Add the following note in the upgrade section for this release:
   
   > Starting with this release, 
yarn.application-attempt-failures-validity-interval defaults to -1 (global 
counting).
   > Clusters that benefit from the previous 10 s sliding window can retain the 
old behaviour by adding
   > `yarn.application-attempt-failures-validity-interval: 10000`
   
   2. Release notes
   Repeat the same notice and example so that operators can quickly restore the 
former setting if needed.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to