[ 
https://issues.apache.org/jira/browse/FLINK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi resolved FLINK-1604.
--------------------------------
    Resolution: Fixed

Fixed in 859a839 and 2f1987a.

> Livelock in PartitionRequestClientFactory
> -----------------------------------------
>
>                 Key: FLINK-1604
>                 URL: https://issues.apache.org/jira/browse/FLINK-1604
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> In case of a job restart, we observed a livelock in 
> {{PartitionRequestClientFactory.createPartitionRequestClient}}. We suspect 
> that this might have the following reason:
> In order to obtain a new {{PartitionRequestClient}} a new 
> {{ConnectingChannel}} is created. This channel acts as a future for the 
> client. The channel is inserted into a {{ConcurrentHashMap}} so that other 
> {{Threads}} trying to create a client for the same address wait on the 
> future. Once the client is obtained by the initially requesting {{Thread}}, 
> it is inserted into the {{HashMap}} instead of the {{ConnectionChannel}}. 
> When the client is disposed, then it will be removed from the {{HashMap}}, 
> but only if the client is still stored in the map. 
> And here is where things can go wrong. If the requesting thread is 
> interrupted after it created the {{ConnectingChannel}} and inserted it into 
> the {{ConcurrentHashMap}} but before inserting the {{PartitionRequestClient}} 
> into the same map, then a the map entry for a given {{RemoteAddress}} is the 
> {{ConnectingChannel}}. Assume now that another thread waited at this channel 
> and eventually obtained the client from this future. In the wake of 
> cancelling the job, the client would be disposed by the corresponding 
> {{RemoteInputChannel}}. Once the job has been restarted, new threads want to 
> connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}} with 
> the disposed {{PartitionRequestClient}} as future result in the hash map. 
> They retrieve the channel and see that the client has already been disposed. 
> Now they try to delete the client from the {{ConcurrentHashMap}} to make room 
> for a new one. However, this deletion fails, because the map still contains 
> the {{ConnectingChannel}}.
> To make a long story short, we believe that the network state is not left in 
> a valid state after cancelling a job.
> That is currently our best theory for the livelock we observed on Travis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to