akashorabek commented on PR #34502: URL: https://github.com/apache/beam/pull/34502#issuecomment-2770711742
> > After investigating the issue, it turns out that sometimes when a failure occurs due to OOM, the worker shuts down immediately and doesn't reach [the part of the code in boot.go](https://github.com/apache/beam/blob/ad7729e05041fc333ae447b5500149dffcf8336d/sdks/go/container/boot.go#L200) responsible for generating the dump file. Attempts to add timeouts before and after reading the file, preallocate additional memory in boot.go, and use parameters like dumpHeapOnOom and saveHeapDumpsToGcsPath didn’t help. Temporarily disabled this test so that The PostCommit Go Dataflow ARM and The PostCommit Go workflows pass successfully. Created a [separate issue](https://github.com/apache/beam/issues/34498) for further investigation. > > Thanks for looking into this - what frequency does this fail at? We can merge this, but I'm curious to know the impact/how often this does/doesn't work These workflows fail around 60-70% of the time due to this error. Interestingly, the failures started occurring around March 19–20, and rolling back to previous PRs didn’t help, so it’s possible that some changes on the GCP side might be the cause. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org