[ 
https://issues.apache.org/jira/browse/FLINK-22420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17366530#comment-17366530
 ] 

Yuan Mei edited comment on FLINK-22420 at 6/21/21, 10:42 AM:
-------------------------------------------------------------

Hey [~trohrmann], thanks for replying!

RPC timeout failure rarely happens, as you can see, this test failure occurs 
nearly 2 months ago. 

What I want to point is there do exist some sets of tests (like this one, and 
some other race condition tests) that rely on the check with the expected 
number of failures.

My proposal of option2 is a way to "NOT count" certain types of failures in 
general conceptually. We would configure in the test cluster what types of 
exceptions are not counted. Even better, we can use a different RestartStrategy.

The question is whether it is worth/necessary to do it in the long term because 
overall the changes increase system complexity. That's why I ask the question.

But I agree that we can start from increase the "RPC timeout".




was (Author: ym):
Hey [~trohrmann], thanks for replying!

RPC timeout failure rarely happens, as you can see, this test failure occurs 
nearly 2 months ago. 

What I want to point is there do exist some sets of tests (like this one, and 
some other race condition tests) that rely on the check with the expected 
number of failures.

My proposal of option2 is a way to "NOT count" certain types of failures in 
general conceptually. We would configure in the test cluster what types of 
exceptions are not counted. Even better, we can use a different RestartStrategy.

The question is whether it is worth/necessary to do it in the long term because 
overall the changes increase system complexity.

But I agree that we can start from increase the "RPC timeout".



> UnalignedCheckpointITCase failed
> --------------------------------
>
>                 Key: FLINK-22420
>                 URL: https://issues.apache.org/jira/browse/FLINK-22420
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0
>            Reporter: Guowei Ma
>            Priority: Minor
>              Labels: auto-deprioritized-major, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=17052&view=logs&j=34f41360-6c0d-54d3-11a1-0292a2def1d9&t=2d56e022-1ace-542f-bf1a-b37dd63243f2&l=9442
> {code:java}
> Apr 22 14:28:21       at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> Apr 22 14:28:21       at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> Apr 22 14:28:21       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> Apr 22 14:28:21       at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> Apr 22 14:28:21       at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> Apr 22 14:28:21       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> Apr 22 14:28:21       at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> Apr 22 14:28:21       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> Apr 22 14:28:21       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> Apr 22 14:28:21       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Apr 22 14:28:21 Caused by: org.apache.flink.util.FlinkException: An 
> OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task 
> failover to ensure consistency. Event: '[NoMoreSplitEvent]', targetTask: 
> Source: source (1/1) - execution #5
> Apr 22 14:28:21       ... 26 more
> Apr 22 14:28:21 
> {code}
> As described in the comment 
> https://issues.apache.org/jira/browse/FLINK-21996?focusedCommentId=17326449&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17326449
>  we might need to adjust the tests  to allow failover.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to