pnowojski commented on code in PR #19993:
URL: https://github.com/apache/flink/pull/19993#discussion_r899887457
##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriteRequest.java:
##########
@@ -109,6 +112,9 @@ static ChannelStateWriteRequest buildFutureWriteRequest(
}
},
throwable -> {
+ if (!dataFuture.isDone()) {
+ return;
+ }
Review Comment:
I agree with @zentol that this doesn't look good and I would be afraid it
could lead to some resource leaks.
It looks to me like the issue is that `dataFuture` is being cancelled from
the chain: `PipelinedSubpartition#release()` <- ... <-
`ResultPartition#release` <- ... <- `NettyShuffleEnvironment#close`. Which
happens after `StreamTask#cleanUp` (which is waiting for this future to
complete), leading to a deadlock.
We would either need to cancel the future sooner (`StreamTask#cleanUp`?)`,
or do what @zentol proposed. I think the latter is indeed a good option. We
don't need to blockingly wait. Let's just not completely ignore exceptions
here. Logging error should be fine.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]