[
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011892#comment-17011892
]
Chesnay Schepler edited comment on FLINK-14163 at 1/9/20 2:40 PM:
------------------------------------------------------------------
Fair enough, if the task is cancelled/failed before the registration completes
then yes, we may be leaking partitions in general.
For this I would amend the lambda function in
{{Execution#registerProducedPartitions}} that starts tracking partitions to
check the state of the execution after starting the tracking of partitions, and
if the execution is not in a scheduled state to immediately untrack them again.
was (Author: zentol):
Fair enough, if the task is cancelled/failed before the registration completes
then yes, we may be leaking partitions in general.
For this I would amend the lambda function in
{{Execution#registerProducedPartitions}} that starts tracking partitions to
check the state of the execution after starting the tracking of partitions, and
if the execution is not in a scheduled state to immmediately untrack them again.
> Execution#producedPartitions is possibly not assigned when used
> ---------------------------------------------------------------
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Zhu Zhu
> Assignee: Yuan Mei
> Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions
> have completed the registration to shuffle master in
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so
> {{Execution#producedPartitions}} is possible[1] not set when used.
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result
> partitions assigned, and the job would hang. (DefaultScheduler issue only,
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks:
> 3. retrieve {{ResultPartitionID}} for partition releasing:
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is
> not problematic at the moment since it returns a completed future on
> registration, so that it would be a synchronized process. However, if users
> implement their own shuffle service in which the
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it
> can be a problem. This is possible since customizable shuffle service is open
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async
> assigning, or
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync
> interface
--
This message was sent by Atlassian Jira
(v8.3.4#803005)