[
https://issues.apache.org/jira/browse/FLINK-14607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16967612#comment-16967612
]
Till Rohrmann commented on FLINK-14607:
---------------------------------------
Yes exactly. Ideally, the {{ExecutionSlotAllocator}} receives all executions
belonging to a single pipelined region and then makes sure that either all or
none get assigned to a slot. Moreover, we only allow slot sharing between tasks
belonging to the same pipelined region. That way we can statically compute the
slot sharing at scheduling time and don't need to touch it later.
> SharedSlot cannot fulfill pending slot requests before it's completely
> released
> -------------------------------------------------------------------------------
>
> Key: FLINK-14607
> URL: https://issues.apache.org/jira/browse/FLINK-14607
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.10.0, 1.9.1
> Reporter: Zhu Zhu
> Priority: Major
>
> Currently a pending request can only be fulfilled when a physical
> slot({{AllocatedSlot}}) becomes available in {{SlotPool}}.
> A shared slot however, cannot be used to fulfill pending requests even if it
> becomes qualified. This may lead to resource deadlocks in certain cases.
> For example, running job A(parallelism=2) --(pipelined)--> B(parallelism=2)
> with 1 slot only, all vertices are in the same slot sharing group, here's
> what may happen:
> 1. Schedule A1 and A2. A1 acquires the only slot, A2's slot request is
> pending because a slot cannot host 2 instances of the same JobVertex at the
> same time. Shared slot status: \{A1\}
> 2. A1 produces data and triggers the scheduling of B1. Shared slot status:
> \{A1, B1\}
> 3. A1 finishes. Shared slot status: \{B1\}
> 4. B1 cannot finish since A2 has not finished, while A2 cannot get launched
> due to no physical slot becomes available, even though the shred slot is
> qualified to host it now. A resource deadlock happens.
> Maybe we should improve {{SlotSharingManager}}. One a task slot is released,
> its root {{MultiTaskSlot}} should be used to try fulfilling existing pending
> task slots from other pending root slots({{unresolvedRootSlots}}) in this
> {{SlotSharingManager}}(means in the same slot sharing group).
> We need to be careful to not cause any failures, and do not violate
> colocation constraints.
> cc [~trohrmann]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)