[
https://issues.apache.org/jira/browse/FLINK-21376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-21376:
-----------------------------------
Labels: stale-major (was: )
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Major but is unassigned and neither itself nor its Sub-Tasks have been updated
for 30 days. I have gone ahead and added a "stale-major" to the issue". If this
ticket is a Major, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> Failed state might not provide failureCause
> -------------------------------------------
>
> Key: FLINK-21376
> URL: https://issues.apache.org/jira/browse/FLINK-21376
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.11.3, 1.12.1, 1.13.0
> Reporter: Matthias
> Priority: Major
> Labels: stale-major
>
> {{Task.executionState}} and {{Task.failureCause}} are not set atomically.
> This became an issue when implementing the exception history (FLINK-21187)
> where we relied on the invariant that a {{failureCause}} is present when the
> {{Task}} failed.
> Adding this check to
> [Task.notifyFinalStage()|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1001]
> will reveal the race condition.
> {{TaskExecutorSlotLifetimeTest}} becomes unstable when adding this invariant.
> The reason is that the test starts a task but does not wait for the task to
> be finished. The [task
> finalization|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L895]
> and [the cancellation of the
> task|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L1105]
> triggered through stopping the {{TaskManager}} shutdown compete with each
> other and could cause the {{executionState}} to be set to {{FAILED}} while
> the {{failureCause}} still being {{null}}. This is then forwarded to
> {{Execution}} through
> [Task.notifyFinalState|https://github.com/apache/flink/blob/9b6f076a66970d3d3ef710f8d5ee66d75d87eba5/flink-runtime/src/main/java/org/apache/flink/runtime/taskmanager/Task.java#L895].
> We should set {{failureCause}} while setting the {{executionState}} to failed
> to not miss any caught error.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)