[ 
https://issues.apache.org/jira/browse/FLINK-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ufuk Celebi resolved FLINK-469.
-------------------------------

       Resolution: Fixed
    Fix Version/s:     (was: pre-apache)

Fixed in 
[2db78a8dc1a4664f3e384005d7e07bea594b835b|https://github.com/apache/incubator-flink/commit/2db78a8dc1a4664f3e384005d7e07bea594b835b].

> LocalDistributedExecutor Deadlock with Low Buffer Count
> -------------------------------------------------------
>
>                 Key: FLINK-469
>                 URL: https://issues.apache.org/jira/browse/FLINK-469
>             Project: Flink
>          Issue Type: Bug
>            Reporter: GitHub Import
>              Labels: github-import
>
> I'm currently working on 
> ([#25|https://github.com/stratosphere/stratosphere/issues/25] | 
> [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]) and discovered a 
> possible deadlock in the network stack, because of the buffer management in 
> combination with the `LocalDistributedExecutor` (LDE).
> The LDE starts a JobManager and multiple TaskManagers on different network 
> ports in a single VM. Every TaskManager has an associated 
> `ByteBufferedChannelManager` (single instance) and `GlobalBufferPool` 
> (singleton) for data transfers. When tasks get registered with a TaskManager 
> (which is atomic per TaskManager), the ChannelManager ensures that there are 
> enough network buffers available to execute the task -- this means that there 
> has to be at least one buffer per task channel. If this condition does not 
> hold, an exception is thrown and the task fails. This decision is made 
> locally per task and not for the whole plan, e.g. for WordCount it is 
> possible that all map tasks get enough buffers, but a following reduce throws 
> an exception at runtime.
> The problem occurs in combination with the LDE: we have multiple TMs with 
> their ChannelManager instances, but only a singleton GlobalBufferPool. This 
> results in a problem with the available buffer computation, because each TM 
> justs considers its local channels (registered at the ChannelManager) and not 
> the channels of others TMs (which is perfectly fine in a real distributed 
> setup). Therefore, it is possible for tasks to deadlock, because of missing 
> buffers (buffer requests are blocking).
> You are likely to reproduce this problem by running 
> `LocalDistributedExecutorTest` and setting the number of buffers to 20 and 
> the buffer size to 4096 bytes (see `ConfigConstants`; make also sure to set 
> `multicastEnabled` in ByteBufferedChannelManager to `false`, because it 
> influences the computation -- multicast does not work anyways).
> I will fix this with the upcoming PR for 
> ([#25|https://github.com/stratosphere/stratosphere/issues/25] | 
> [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]).
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/469
> Created by: [uce|https://github.com/uce]
> Labels: bug, runtime, 
> Assignee: [uce|https://github.com/uce]
> Created at: Wed Feb 12 13:58:36 CET 2014
> State: open



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to