[
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17008965#comment-17008965
]
Andrey Zagrebin commented on FLINK-14163:
-----------------------------------------
Thanks for analysis [~zhuzh]. Indeed, this is a separate state whether we
triggered registering of partitions or not. If task is just in scheduled state
while being canceled we ignore partitions as we handle partitions only for
deployed tasks (next state).
I agree that this is not an immediate problem but we have to be honest with the
API and at least use it consistently as synchronous API. The quickest fix could
be to just call get on the partitions future immediately and document that it
is not used asynchronously atm. Later, if we see that it is a problem for some
implementations, we can fix it to correctly handle this asynchronous state of
task/partitions lifecycle.
Wdyt?
cc [~chesnay]
> Execution#producedPartitions is possibly not assigned when used
> ---------------------------------------------------------------
>
> Key: FLINK-14163
> URL: https://issues.apache.org/jira/browse/FLINK-14163
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.0, 1.10.0
> Reporter: Zhu Zhu
> Priority: Major
> Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions
> have completed the registration to shuffle master in
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so
> {{Execution#producedPartitions}} is possible[1] not set when used.
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result
> partitions assigned, and the job would hang. (DefaultScheduler issue only,
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks:
> 3. retrieve {{ResultPartitionID}} for partition releasing:
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is
> not problematic at the moment since it returns a completed future on
> registration, so that it would be a synchronized process. However, if users
> implement their own shuffle service in which the
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it
> can be a problem. This is possible since customizable shuffle service is open
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async
> assigning, or
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync
> interface
--
This message was sent by Atlassian Jira
(v8.3.4#803005)