akashorabek commented on PR #34502:
URL: https://github.com/apache/beam/pull/34502#issuecomment-2770711742

   > > After investigating the issue, it turns out that sometimes when a 
failure occurs due to OOM, the worker shuts down immediately and doesn't reach 
[the part of the code in 
boot.go](https://github.com/apache/beam/blob/ad7729e05041fc333ae447b5500149dffcf8336d/sdks/go/container/boot.go#L200)
 responsible for generating the dump file. Attempts to add timeouts before and 
after reading the file, preallocate additional memory in boot.go, and use 
parameters like dumpHeapOnOom and saveHeapDumpsToGcsPath didn’t help. 
Temporarily disabled this test so that The PostCommit Go Dataflow ARM and The 
PostCommit Go workflows pass successfully. Created a [separate 
issue](https://github.com/apache/beam/issues/34498) for further investigation.
   > 
   > Thanks for looking into this - what frequency does this fail at? We can 
merge this, but I'm curious to know the impact/how often this does/doesn't work
   
   These workflows fail around 60-70% of the time due to this error. 
Interestingly, the failures started occurring around March 19–20, and rolling 
back to previous PRs didn’t help, so it’s possible that some changes on the GCP 
side might be the cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to