[
https://issues.apache.org/jira/browse/FLINK-1604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ufuk Celebi resolved FLINK-1604.
--------------------------------
Resolution: Fixed
Fixed in 859a839 and 2f1987a.
> Livelock in PartitionRequestClientFactory
> -----------------------------------------
>
> Key: FLINK-1604
> URL: https://issues.apache.org/jira/browse/FLINK-1604
> Project: Flink
> Issue Type: Bug
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
>
> In case of a job restart, we observed a livelock in
> {{PartitionRequestClientFactory.createPartitionRequestClient}}. We suspect
> that this might have the following reason:
> In order to obtain a new {{PartitionRequestClient}} a new
> {{ConnectingChannel}} is created. This channel acts as a future for the
> client. The channel is inserted into a {{ConcurrentHashMap}} so that other
> {{Threads}} trying to create a client for the same address wait on the
> future. Once the client is obtained by the initially requesting {{Thread}},
> it is inserted into the {{HashMap}} instead of the {{ConnectionChannel}}.
> When the client is disposed, then it will be removed from the {{HashMap}},
> but only if the client is still stored in the map.
> And here is where things can go wrong. If the requesting thread is
> interrupted after it created the {{ConnectingChannel}} and inserted it into
> the {{ConcurrentHashMap}} but before inserting the {{PartitionRequestClient}}
> into the same map, then a the map entry for a given {{RemoteAddress}} is the
> {{ConnectingChannel}}. Assume now that another thread waited at this channel
> and eventually obtained the client from this future. In the wake of
> cancelling the job, the client would be disposed by the corresponding
> {{RemoteInputChannel}}. Once the job has been restarted, new threads want to
> connect to the {{RemoteAddress}} and they find the {{ConnectingChannel}} with
> the disposed {{PartitionRequestClient}} as future result in the hash map.
> They retrieve the channel and see that the client has already been disposed.
> Now they try to delete the client from the {{ConcurrentHashMap}} to make room
> for a new one. However, this deletion fails, because the map still contains
> the {{ConnectingChannel}}.
> To make a long story short, we believe that the network state is not left in
> a valid state after cancelling a job.
> That is currently our best theory for the livelock we observed on Travis.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)