[jira] [Commented] (BEAM-6777) SDK Harness Resilience

Oded Valtzer (JIRA) Sun, 11 Aug 2019 06:53:11 -0700


    [ 
https://issues.apache.org/jira/browse/BEAM-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16904657#comment-16904657
 ]


Oded Valtzer commented on BEAM-6777:
------------------------------------

Hey guys,
quick question on this, we are experiencing OOMs after more then 1 days of 
running the pipeline. we do intense CPU\Memory computations in single step in 
the pipeline and at some point one or more workers reach OOM.
at this point the worker is being killed by windmill (we run python 2.7 
streaming on dataflow, beam 2.14). 
Once the workers get into this state they never recover and reach what you 
describe in the description of this ticker..i failed to understand what is the 
status of the ticket, can you briefly explain?

Thanks for working on this
Oded

> SDK Harness Resilience
> ----------------------
>
>                 Key: BEAM-6777
>                 URL: https://issues.apache.org/jira/browse/BEAM-6777
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow
>            Reporter: Sam Rohde
>            Assignee: Yueyang Qiu
>            Priority: Major
>          Time Spent: 7h 20m
>  Remaining Estimate: 0h
>
> If the Python SDK Harness crashes in any way (user code exception, OOM, etc) 
> the job will hang and waste resources. The fix is to add a daemon in the SDK 
> Harness and Runner Harness to communicate with Dataflow to restart the VM 
> when stuckness is detected.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (BEAM-6777) SDK Harness Resilience

Reply via email to