[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014085#comment-17014085
 ] 

zhijiang commented on FLINK-14163:
----------------------------------

Thanks for the above good suggestions from you guys! Sorry for coming back this 
issue a bit late, especially for the PR already ready.

My previous guessing was that the formal support of async way would bring big 
trouble for scheduler, or it may be conflict with new scheduler direction in 
long term. Also considering the shuffle async way a bit over design then and no 
real users atm, so I mentioned before that I can accept the way of adjusting 
into the sync way to stop loss early. Although I also thought in general it is 
not a good way to break compatibility for exposed public interface. If it is 
not a problem for scheduler for handling the async way in future, I am happy to 
retain the async shuffle way.

If we decide to retain the async way and work around it in scheduler 
temporarily, it might be better to not fail directly after checking the future 
not completed. I mean we can step forward to bear a timeout before failing. 
This timeout is not only used for waiting future completion, also used for 
waiting for the future return by shuffle master while calling to avoid main 
thread stuck long time.

> Execution#producedPartitions is possibly not assigned when used
> ---------------------------------------------------------------
>
>                 Key: FLINK-14163
>                 URL: https://issues.apache.org/jira/browse/FLINK-14163
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Yuan Mei
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to