[ https://issues.apache.org/jira/browse/FLINK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17365228#comment-17365228 ]
Yuan Mei commented on FLINK-22420: ---------------------------------- After syncing up with [~pnowojski] offline, removing "maxNumberRestartAttempts" constraints may hide potential "race condition" that can be revealed by "maxNumberRestartAttempts" constraints, which I think is a valid point. Here are a couple of things to follow up: 1. Double-check whether the "throw exception and global failover" is a long-term solution for FLINK-21996. Option2 in FLINK-21996 is more complicated but seems to be cleaner in long term from a system design perspective. 2. If we really want to keep FLINK-21996 as it is because of some other considerations, I would propose the following way which can be generalized to other similar test failures relying on "number of expected failures": 1). Add a new type of Exception. Currently, the failover exception thrown from FLINK-21996 is a "org.apache.flink.util.FlinkException: An OperatorEvent from an OperatorCoordinator..."; If we want to treat this exception in a different way, we need to add a new type, to make future change/amends maintainable. 2). bump the number of allowed failures by one whenever a failure happens due to 1). Right now maxNumberRestartAttempts fixed preset number, but a fairly-not-that-hacky workaround should be easy to do as well. 3). Generalize this approach to tests relying on "number of expected failures" > UnalignedCheckpointITCase failed > -------------------------------- > > Key: FLINK-22420 > URL: https://issues.apache.org/jira/browse/FLINK-22420 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing > Affects Versions: 1.14.0 > Reporter: Guowei Ma > Priority: Minor > Labels: auto-deprioritized-major, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17052&view=logs&j=34f41360-6c0d-54d3-11a1-0292a2def1d9&t=2d56e022-1ace-542f-bf1a-b37dd63243f2&l=9442 > {code:java} > Apr 22 14:28:21 at > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > Apr 22 14:28:21 at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > Apr 22 14:28:21 at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > Apr 22 14:28:21 at > akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > Apr 22 14:28:21 at > akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > Apr 22 14:28:21 at akka.actor.ActorCell.invoke(ActorCell.scala:561) > Apr 22 14:28:21 at > akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > Apr 22 14:28:21 at akka.dispatch.Mailbox.run(Mailbox.scala:225) > Apr 22 14:28:21 at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > Apr 22 14:28:21 at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Apr 22 14:28:21 Caused by: org.apache.flink.util.FlinkException: An > OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task > failover to ensure consistency. Event: '[NoMoreSplitEvent]', targetTask: > Source: source (1/1) - execution #5 > Apr 22 14:28:21 ... 26 more > Apr 22 14:28:21 > {code} > As described in the comment > https://issues.apache.org/jira/browse/FLINK-21996?focusedCommentId=17326449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17326449 > we might need to adjust the tests to allow failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)