[ 
https://issues.apache.org/jira/browse/FLINK-9413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16558024#comment-16558024
 ] 

ASF GitHub Bot commented on FLINK-9413:
---------------------------------------

zhijiangW commented on issue #6103: [FLINK-9413] [distributed coordination] 
Tasks can fail with Partition…
URL: https://github.com/apache/flink/pull/6103#issuecomment-408008057
 
 
   This exception is also ever caused in our large-scale applications, and we 
increase the `taskmanager.network.request-backoff.max` in cluster level to make 
it well.
   
   I agree with keeping the current default value in codes because it may delay 
the unit tests or itcases if increasing the value. Then the config can be 
adjusted based on job or cluster level if user meets this exception.
   
   Maybe we can register the task in network ASAP during running, for example, 
put `registerTask` in front of blob cache process, that may avoid the 
unnecessary failover in most cases. :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Tasks can fail with PartitionNotFoundException if consumer deployment takes 
> too long
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-9413
>                 URL: https://issues.apache.org/jira/browse/FLINK-9413
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.4.0, 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: zhangminglei
>            Priority: Critical
>              Labels: pull-request-available
>
> {{Tasks}} can fail with a {{PartitionNotFoundException}} if the deployment of 
> the producer takes too long. More specifically, if it takes longer than the 
> {{taskmanager.network.request-backoff.max}}, then the {{Task}} will give up 
> and fail.
> The problem is that we calculate the {{InputGateDeploymentDescriptor}} for a 
> consuming task once the producer has been assigned a slot but we do not wait 
> until it is actually running. The problem should be fixed if we wait until 
> the task is in state {{RUNNING}} before assigning the result partition to the 
> consumer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to