damccorm commented on issue #21797: URL: https://github.com/apache/beam/issues/21797#issuecomment-1179337648
As I currently understand it, using the `GOTRACEBACK=crash` is going to be technically impossible. The location of the core dump is dependent on the underlying kernel and we don't have the ability to change that (since the container is mounted read only, and that type of manipulation is almost definitely a no-go in pretty much every context). A good example of this being hard is Dataflow - on Dataflow, the contents of `/proc/sys/kernel/core_pattern` currently are `/sbin/crash_reporter --user=%P:%s:%u:%g:%f` - this means that any core dumps will be written to the crash reporter (a chrome concept) if one exists. I'm not even sure if that will happen (e.g. can we write to that from the container? Probably not), but regardless we won't be able to read it. Even if we were able to modify the core_pattern file while building our container, I think a solution that is extensible to arbitrary runners is not possible with that approach since it relies on the behavior of that particular operating system. We'd need to have intimate knowledge of the OS/kernel/configuration that we're running on top of _and_ resilient to any changes there (ideally without requiring an SDK upgrade - we shouldn't be coupled to the version of hardware we're running on). --------------------------------------------------------------------------------- That leaves a few options as I see it: 1) We could take a heap dump after the process crashes. I'm not sure if that has utility or not, I'd have to examine if any useful info is contained on the heap or if its totally wiped on process completion. 2) We could upload a heap dump every few minutes (this is probably bad, its expensive and won't necessarily catch the problem) 3) We could try to build out something like Java has and monitor _something_ and upload a heap dump when that happens. 4) We can not do this. In theory (1) or (3) is the best option, but I have no clue what we would actually monitor for (3). Java monitors GC thrashing which doesn't really make sense for Go. I don't know of any compelling signals to use here. My next steps here: 1. Investigate whether a post-process completion heap dump offers any value. 2. If that doesn't uncover _significant_ value, consider ways we could do option (3). I'm not super hopeful here. 3. Based on the results of the investigation, do any of options (1), (3), or (4) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
