[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014147#comment-17014147
 ] 

Yuan Mei commented on FLINK-14163:
----------------------------------

 

Synced up with [~zjwang] offline. 

 

In general, I would prefer not to change interfaces unless having to, since 
interface changes have more profound impact on users than implementation 
changes. Especially in this case, given the future design is not determined 
(both Shuffle and Scheduler), changing async -> sycn and then sync -> async 
again bring much more pain to users than keeping the asycn interfaces and 
enforces a sync implementation underneath. 

In the latter case, a user barely needs to do anything if we do want to support 
async implementation in the future, while in the first case, every single user 
that uses the interface has to make a new release to adjust to the updated 
interface.

 

I do not have strong preferences other than the interface. I think retain a 
timeout is a good compromise, but may introduce some inconsistent system 
behavior. Hence even if we decide to retain a timeout, I would prefer to 
document as `enforce a synchronous implementation under asynchronous interface`.

 

> Execution#producedPartitions is possibly not assigned when used
> ---------------------------------------------------------------
>
>                 Key: FLINK-14163
>                 URL: https://issues.apache.org/jira/browse/FLINK-14163
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Yuan Mei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to