[ https://issues.apache.org/jira/browse/BEAM-6777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906964#comment-16906964 ]
Ankur Goenka edited comment on BEAM-6777 at 8/14/19 7:11 AM: ------------------------------------------------------------- Thanks for providing the details. Pointer to the code will will useful. However, as I don't have access to the production machines, I will not be able to see any of the logs. It will be a good idea to open a ticket with the support team to get it resolved in a timely manner. was (Author: angoenka): Yes, that would be useful. > SDK Harness Resilience > ---------------------- > > Key: BEAM-6777 > URL: https://issues.apache.org/jira/browse/BEAM-6777 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow > Reporter: Sam Rohde > Assignee: Yueyang Qiu > Priority: Major > Time Spent: 7h 20m > Remaining Estimate: 0h > > If the Python SDK Harness crashes in any way (user code exception, OOM, etc) > the job will hang and waste resources. The fix is to add a daemon in the SDK > Harness and Runner Harness to communicate with Dataflow to restart the VM > when stuckness is detected. -- This message was sent by Atlassian JIRA (v7.6.14#76016)