[ 
https://issues.apache.org/jira/browse/FLINK-14163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17012507#comment-17012507
 ] 

Yuan Mei commented on FLINK-14163:
----------------------------------

Hey, [~chesnay], [~zhuzh], [~azagrebin] and [~gjy], thank you so much for the 
summary! It is thorough and clear. Learned a lot :)

 

The fundamental problem is the assignment and access of 
`Execution#producedPartitions` mismatches. That is, assignment is through an 
asynchronous interface, but not all of the accesses are using callbacks. That 
is the reason why misuses are introduced in the first place.

 

Given the only implementation of `ShuffleMaster` is synchronous 
(`NettyShuffleMaster`), there is no immediate needs to clean up all the usage. 
Instead, the more important thing is to eliminate the risk to introduce more 
misuses or lead the system behave in an unexpected way.

 

Along with the discussion, I think everyone agrees to keep the async interface 
of `ShuffleMaster#registerPartitionWithProducer` for the purpose of 
flexibility, but enforce a synchronous registration underneath. Then the 
question is what is the best place to put the enforcement.

 

[In my previous 
comment|https://issues.apache.org/jira/browse/FLINK-14163?focusedCommentId=17011568&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17011568],
  

Option1 changes all access to `Execution#producedPartitions` as a callback, 
hence the interfaces for both assignment and access are consistent. The 
synchronous registration is enforced and wrapped before accessing.

But since there are different ways to access producedPartitions, Option1 has 
some limitations (discussed in the comment). This is also the reason I am a bit 
hesitate to fix the usage (access) of producedPartitions directly.

 

Option2 is a refinement of Option1 but a bit ugly.

 

Option3 enforces the synchronous registration from the registration side, and 
immediately fails if the registration is not synchronous. This is a much 
simpler and safe change for the purpose of  `eliminating the risk to introduce 
more misuses and to lead the system behave in an unexpected way.`

 

If asynchronous registration is needed in the future, we should fix the access 
part of course.

 

> Execution#producedPartitions is possibly not assigned when used
> ---------------------------------------------------------------
>
>                 Key: FLINK-14163
>                 URL: https://issues.apache.org/jira/browse/FLINK-14163
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Yuan Mei
>            Priority: Major
>             Fix For: 1.10.0
>
>
> Currently {{Execution#producedPartitions}} is assigned after the partitions 
> have completed the registration to shuffle master in 
> {{Execution#registerProducedPartitions(...)}}.
> The partition registration is an async interface 
> ({{ShuffleMaster#registerPartitionWithProducer(...)}}), so 
> {{Execution#producedPartitions}} is possible[1] not set when used. 
> Usages includes:
> 1. deploying this task, so that the task may be deployed without its result 
> partitions assigned, and the job would hang. (DefaultScheduler issue only, 
> since legacy scheduler handled this case)
> 2. generating input descriptors for downstream tasks: 
> 3. retrieve {{ResultPartitionID}} for partition releasing: 
> [1] If a user uses Flink default shuffle master {{NettyShuffleMaster}}, it is 
> not problematic at the moment since it returns a completed future on 
> registration, so that it would be a synchronized process. However, if users 
> implement their own shuffle service in which the 
> {{ShuffleMaster#registerPartitionWithProducer}} returns an pending future, it 
> can be a problem. This is possible since customizable shuffle service is open 
> to users since 1.9 (via config "shuffle-service-factory.class").
> To avoid issues to happen, we may either 
> 1. fix all the usages of {{Execution#producedPartitions}} regarding the async 
> assigning, or 
> 2. change {{ShuffleMaster#registerPartitionWithProducer(...)}} to a sync 
> interface



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to