Till Rohrmann created FLINK-4711:
------------------------------------
Summary: TaskManager can crash due to failing
onPartitionStateUpdate call
Key: FLINK-4711
URL: https://issues.apache.org/jira/browse/FLINK-4711
Project: Flink
Issue Type: Bug
Components: Distributed Coordination
Affects Versions: 1.2.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Fix For: 1.2.0
The {{TaskManager}} can crash because it calls {{Task.onPartitionStateUpdate}}
when it receives a {{PartitionState}} message. The {{onPartitionStateUpdate}}
method can throw an {{IOException}} or {{InterruptedException}} which are not
handled on the {{TaskManager}} level.
Another problem is that the initial partition state request is triggered within
the {{SingleInputGate}}. The request causes the {{JobManager}} to send a
{{PartitionState}} message to the {{TaskManager}} which forwards it to the
{{Task}}. If the at any of these points a message gets lost, then it is not
retried and the partition state remains unknown.
In order to handle the exceptions, to make the data flow clearer and to add
automatic retries, I propose to let the {{Task}} send the partition state check
requests. Furthermore, the {{JobManager}} should directly answer to the
{{Task}} by replying to an ask operation. That way the message does not have to
be routed through the {{TaskManager}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)