XComp commented on pull request #14798: URL: https://github.com/apache/flink/pull/14798#issuecomment-779102405
I looked through the code (supported by @AHeise): The race condition really only kicks in when cancelling/failing the task because that's when the `failureCause` becomes relevant. So, instead of synchronizing the [state transition](https://github.com/XComp/flink/blob/5781449f38c1e36c1a2952518f9e30761d915f04/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1052) we could add synchronize blocks for the [cancellation of a task](https://github.com/XComp/flink/blob/5781449f38c1e36c1a2952518f9e30761d915f04/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1130) and while [handling failure handling during the normal Invokable execution](https://github.com/XComp/flink/blob/5781449f38c1e36c1a2952518f9e30761d915f04/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L828). The synchronization will only cover the state transition and setting the `failureCause`. Cancelling the corresponding task would be moved out of the synchronization block. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
