[ 
https://issues.apache.org/jira/browse/FLINK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366471#comment-17366471
 ] 

Till Rohrmann commented on FLINK-22420:
---------------------------------------

What solution approach do you refer to with option 2 [~ym]?

I am not sure whether we should really do the improvements under point 2. 
Especially the logic to bump the number of {{maxNumberRestartAttempts}} if a 
special {{Exception}} occurs does not feel right to me. Maybe it would be 
better to introduce a special {{RestartStrategy}} which ignores certain 
exception for testing purposes if this is really required.

For me the first question would be why do we see these failures? Do we have 
processing gaps of more than 10s on our test infrastructure? Maybe if the 
infrastructure is overloaded, then an easy but not perfect fix could be to 
increase the RPC timeout.

In general, if we have tests that rely on certain order and number of 
exceptions to occur then the assumption is that no other problems occur (e.g. 
we need to provide a stable cluster). So either we can provide a stable test 
setup or we can relax this assumption by not relying on the exact number, for 
example.

> UnalignedCheckpointITCase failed
> --------------------------------
>
>                 Key: FLINK-22420
>                 URL: https://issues.apache.org/jira/browse/FLINK-22420
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0
>            Reporter: Guowei Ma
>            Priority: Minor
>              Labels: auto-deprioritized-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17052&view=logs&j=34f41360-6c0d-54d3-11a1-0292a2def1d9&t=2d56e022-1ace-542f-bf1a-b37dd63243f2&l=9442
> {code:java}
> Apr 22 14:28:21       at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> Apr 22 14:28:21       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> Apr 22 14:28:21       at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> Apr 22 14:28:21       at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> Apr 22 14:28:21       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> Apr 22 14:28:21       at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> Apr 22 14:28:21       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> Apr 22 14:28:21       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Apr 22 14:28:21 Caused by: org.apache.flink.util.FlinkException: An 
> OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task 
> failover to ensure consistency. Event: '[NoMoreSplitEvent]', targetTask: 
> Source: source (1/1) - execution #5
> Apr 22 14:28:21       ... 26 more
> Apr 22 14:28:21 
> {code}
> As described in the comment 
> https://issues.apache.org/jira/browse/FLINK-21996?focusedCommentId=17326449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17326449
>  we might need to adjust the tests  to allow failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to