[
https://issues.apache.org/jira/browse/FLINK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742435#comment-17742435
]
Patrick Lucas commented on FLINK-32583:
---------------------------------------
Further investigation suggests that this can only happen when a request is made
after the RestClient itself is closed, which asynchronously shuts down the
event loop used by Netty.
Many uses of RestClient will use try-with-resources where this wouldn't be an
issue, but it should still have the correct behavior when used in other
contexts where a request may be attempted after shutdown has already started.
> RestClient can deadlock if request made after Netty event executor terminated
> -----------------------------------------------------------------------------
>
> Key: FLINK-32583
> URL: https://issues.apache.org/jira/browse/FLINK-32583
> Project: Flink
> Issue Type: Bug
> Components: Runtime / REST
> Affects Versions: 1.18.0, 1.16.3, 1.17.2
> Reporter: Patrick Lucas
> Priority: Minor
> Labels: pull-request-available
>
> The RestClient can deadlock if a request is made after the Netty event
> executor has terminated.
> This is due to the listener that would resolve the CompletableFuture that is
> attached to the ChannelFuture returned by the call to Netty to connect not
> being able to run because the executor to run it rejects the execution.
> [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]:
> {code:java}
> final ChannelFuture connectFuture = bootstrap.connect(targetAddress,
> targetPort);
> final CompletableFuture<Channel> channelFuture = new CompletableFuture<>();
> connectFuture.addListener(
> (ChannelFuture future) -> {
> if (future.isSuccess()) {
> channelFuture.complete(future.channel());
> } else {
> channelFuture.completeExceptionally(future.cause());
> }
> });
> {code}
> In this code, the call to {{addListener()}} can fail silently (only logging
> to the console), meaning any code waiting on the CompletableFuture returned
> by this method will deadlock.
> There was some work in Netty around this back in 2015, but it's unclear to me
> how this situation is expected to be handled given the discussion and changes
> from these issues:
> * [https://github.com/netty/netty/issues/3449] (open)
> * [https://github.com/netty/netty/pull/3483] (closed)
> * [https://github.com/netty/netty/pull/3566] (closed)
> * [https://github.com/netty/netty/pull/5087] (merged)
> I think a reasonable fix for Flink would be to check the state of
> {{connectFuture}} and {{channelFuture}} immediately after the call to
> {{addListener()}}, resolving {{channelFuture}} with
> {{completeExceptionally()}} if {{connectFuture}} is done and failed and
> {{channelFuture}} has not been completed. In the possible race condition
> where the listener was attached successfully and the connection fails
> instantly, the result is the same, as calls to
> {{CompletableFuture#completeExceptionally()}} are idempotent.
> A workaround for users of RestClient is to call {{CompletableFuture#get(long
> timeout, TimeUnit unit)}} rather than {{#get()}} or {{#join()}} on the
> CompletableFutures it returns. However, if the call throws TimeoutException,
> the cause of the failure cannot easily be determined.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)