Thanks for opening this PR @TisonKun. 

Before diving into the details of this PR I'd like to know whether you've 
observed that the JM crashes or is this more of theoretical nature? If it does 
crash indeed, then I would be interested to learn why, because the 
`requestPartitionState` method should not be blocking at all. How many 
`requestPartitionState` messages are in generated in the crash case?

Another question is concerning your assumptions: You said that 
`retriggerPartitionRequest` would fail if the producer is gone. With producer 
do you mean the producing `Task` or the `TaskManager`. In the former case, I 
think the remote `TaskManager` would simply respond with a 
`PartitionNotFoundException` which retriggers the same partition request method 
again. Thus, I'm not quite sure whether the consumer task would actually fail 
or simply retry infinitely. The latter result is imo what we try to prevent 
with asking the JM about the state of the result partition.

[ Full content available at: https://github.com/apache/flink/pull/6680 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to