[
https://issues.apache.org/jira/browse/FLINK-469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ufuk Celebi resolved FLINK-469.
-------------------------------
Resolution: Fixed
Fix Version/s: (was: pre-apache)
Fixed in
[2db78a8dc1a4664f3e384005d7e07bea594b835b|https://github.com/apache/incubator-flink/commit/2db78a8dc1a4664f3e384005d7e07bea594b835b].
> LocalDistributedExecutor Deadlock with Low Buffer Count
> -------------------------------------------------------
>
> Key: FLINK-469
> URL: https://issues.apache.org/jira/browse/FLINK-469
> Project: Flink
> Issue Type: Bug
> Reporter: GitHub Import
> Labels: github-import
>
> I'm currently working on
> ([#25|https://github.com/stratosphere/stratosphere/issues/25] |
> [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]) and discovered a
> possible deadlock in the network stack, because of the buffer management in
> combination with the `LocalDistributedExecutor` (LDE).
> The LDE starts a JobManager and multiple TaskManagers on different network
> ports in a single VM. Every TaskManager has an associated
> `ByteBufferedChannelManager` (single instance) and `GlobalBufferPool`
> (singleton) for data transfers. When tasks get registered with a TaskManager
> (which is atomic per TaskManager), the ChannelManager ensures that there are
> enough network buffers available to execute the task -- this means that there
> has to be at least one buffer per task channel. If this condition does not
> hold, an exception is thrown and the task fails. This decision is made
> locally per task and not for the whole plan, e.g. for WordCount it is
> possible that all map tasks get enough buffers, but a following reduce throws
> an exception at runtime.
> The problem occurs in combination with the LDE: we have multiple TMs with
> their ChannelManager instances, but only a singleton GlobalBufferPool. This
> results in a problem with the available buffer computation, because each TM
> justs considers its local channels (registered at the ChannelManager) and not
> the channels of others TMs (which is perfectly fine in a real distributed
> setup). Therefore, it is possible for tasks to deadlock, because of missing
> buffers (buffer requests are blocking).
> You are likely to reproduce this problem by running
> `LocalDistributedExecutorTest` and setting the number of buffers to 20 and
> the buffer size to 4096 bytes (see `ConfigConstants`; make also sure to set
> `multicastEnabled` in ByteBufferedChannelManager to `false`, because it
> influences the computation -- multicast does not work anyways).
> I will fix this with the upcoming PR for
> ([#25|https://github.com/stratosphere/stratosphere/issues/25] |
> [FLINK-25|https://issues.apache.org/jira/browse/FLINK-25]).
> ---------------- Imported from GitHub ----------------
> Url: https://github.com/stratosphere/stratosphere/issues/469
> Created by: [uce|https://github.com/uce]
> Labels: bug, runtime,
> Assignee: [uce|https://github.com/uce]
> Created at: Wed Feb 12 13:58:36 CET 2014
> State: open
--
This message was sent by Atlassian JIRA
(v6.2#6252)