[ 
https://issues.apache.org/jira/browse/FLINK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32583:
-----------------------------------
    Labels: pull-request-available  (was: )

> RestClient can deadlock if request made after Netty event executor terminated
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-32583
>                 URL: https://issues.apache.org/jira/browse/FLINK-32583
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / REST
>    Affects Versions: 1.18.0, 1.16.3, 1.17.2
>            Reporter: Patrick Lucas
>            Priority: Minor
>              Labels: pull-request-available
>
> The RestClient can deadlock if a request is made after the Netty event 
> executor has terminated.
> This is due to the listener that would resolve the CompletableFuture that is 
> attached to the ChannelFuture returned by the call to Netty to connect not 
> being able to run because the executor to run it rejects the execution.
> [RestClient.java|https://github.com/apache/flink/blob/release-1.17.1/flink-runtime/src/main/java/org/apache/flink/runtime/rest/RestClient.java#L471-L482]:
> {code:java}
> final ChannelFuture connectFuture = bootstrap.connect(targetAddress, 
> targetPort);
> final CompletableFuture<Channel> channelFuture = new CompletableFuture<>();
> connectFuture.addListener(
>         (ChannelFuture future) -> {
>             if (future.isSuccess()) {
>                 channelFuture.complete(future.channel());
>             } else {
>                 channelFuture.completeExceptionally(future.cause());
>             }
>         });
> {code}
> In this code, the call to {{addListener()}} can fail silently (only logging 
> to the console), meaning any code waiting on the CompletableFuture returned 
> by this method will deadlock.
> There was some work in Netty around this back in 2015, but it's unclear to me 
> how this situation is expected to be handled given the discussion and changes 
> from these issues:
>  * [https://github.com/netty/netty/issues/3449] (open)
>  * [https://github.com/netty/netty/pull/3483] (closed)
>  * [https://github.com/netty/netty/pull/3566] (closed)
>  * [https://github.com/netty/netty/pull/5087] (merged)
> I think a reasonable fix for Flink would be to check the state of 
> {{connectFuture}} and {{channelFuture}} immediately after the call to 
> {{addListener()}}, resolving {{channelFuture}} with 
> {{completeExceptionally()}} if {{connectFuture}} is done and failed and 
> {{channelFuture}} has not been completed. In the possible race condition 
> where the listener was attached successfully and the connection fails 
> instantly, the result is the same, as calls to 
> {{CompletableFuture#completeExceptionally()}} are idempotent.
> A workaround for users of RestClient is to call {{CompletableFuture#get(long 
> timeout, TimeUnit unit)}} rather than {{#get()}} or {{#join()}} on the 
> CompletableFutures it returns. However, if the call throws TimeoutException, 
> the cause of the failure cannot easily be determined.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to