[
https://issues.apache.org/jira/browse/BEAM-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906528#comment-16906528
]
Ankur Goenka commented on BEAM-6777:
------------------------------------
The task is intended to avoid pipeline getting stuck without any visible errors
and does not deal with pipeline failures.
I am not sure if you are using batch or streaming but for streaming, we simply
kill the workers and do not terminate the pipeline while for batch, we
terminate the pipeline after a set number of retries.
Their are a couple of PRs linked in this Jira for this Pipeline stuck issue.
For this particular OOM case, we need more debugging as even the current fix
would not mitigate OOM and will only surface the problem instead of silently
getting stuck.
Can you identify the memory of different process by logging into the
corresponding VM before it goes OOM?
> SDK Harness Resilience
> ----------------------
>
> Key: BEAM-6777
> URL: https://issues.apache.org/jira/browse/BEAM-6777
> Project: Beam
> Issue Type: Improvement
> Components: runner-dataflow
> Reporter: Sam Rohde
> Assignee: Yueyang Qiu
> Priority: Major
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> If the Python SDK Harness crashes in any way (user code exception, OOM, etc)
> the job will hang and waste resources. The fix is to add a daemon in the SDK
> Harness and Runner Harness to communicate with Dataflow to restart the VM
> when stuckness is detected.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)