XComp commented on pull request #15898: URL: https://github.com/apache/flink/pull/15898#issuecomment-891799420
> Hi @zentol I just got back from a vacation so I'm looking at this again. I'm not sure what we want specification wise with concurrent failure support. I can imagine all sorts of things failing concurrently. It seems like maybe concurrent failures would be best tested in a more elaborate integration test to have a very clear expectation of correct behavior? Perhaps that could be a follow up ticket to make this current minimal exception handling landable quickly after I address the current feedback? Hi @bytesandwich , I'm back from vacation so I am able to answer your questions. Testing concurrent failures should be possible as part of the `AdaptiveSchedulerTest`. Similarly to what you've done in [AdaptiveSchedulerTest:929](https://github.com/apache/flink/blob/521e19eecadf39a226c5b5be4ed5348485656eab/flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java#L929) with one `updateTaskExecutionState` call you should be able to do with two calls. The first call will make the `AdaptiveScheduler` switch into restarting state. Calling the `updateTaskExecutionState` again would not catch the second exception right now. Implementing the exception handling also in the `Restarting` state class should solve the issue. Analogously, that has to be done for cases where the scheduler does not switch into `Restarting` state but `Failing`. Does that make sense to you? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
