damccorm commented on issue #21797:
URL: https://github.com/apache/beam/issues/21797#issuecomment-1179337648

   As I currently understand it, using the `GOTRACEBACK=crash` is going to be 
technically impossible. The location of the core dump is dependent on the 
underlying kernel and we don't have the ability to change that (since the 
container is mounted read only, and that type of manipulation is almost 
definitely a no-go in pretty much every context).
   
   A good example of this being hard is Dataflow - on Dataflow, the contents of 
`/proc/sys/kernel/core_pattern` currently are `/sbin/crash_reporter 
--user=%P:%s:%u:%g:%f` - this means that any core dumps will be written to the 
crash reporter (a chrome concept) if one exists. I'm not even sure if that will 
happen (e.g. can we write to that from the container? Probably not), but 
regardless we won't be able to read it. Even if we were able to modify the 
core_pattern file while building our container, I think a solution that is 
extensible to arbitrary runners is not possible with that approach since it 
relies on the behavior of that particular operating system. We'd need to have 
intimate knowledge of the OS/kernel/configuration that we're running on top of 
_and_ resilient to any changes there (ideally without requiring an SDK upgrade 
- we shouldn't be coupled to the version of hardware we're running on).
   
   
---------------------------------------------------------------------------------
   
   That leaves a few options as I see it:
   
   1) We could take a heap dump after the process crashes. I'm not sure if that 
has utility or not, I'd have to examine if any useful info is contained on the 
heap or if its totally wiped on process completion.
   2) We could upload a heap dump every few minutes (this is probably bad, its 
expensive and won't necessarily catch the problem)
   3) We could try to build out something like Java has and monitor _something_ 
and upload a heap dump when that happens.
   4) We can not do this.
   
   In theory (1) or (3) is the best option, but I have no clue what we would 
actually monitor for (3). Java monitors GC thrashing which doesn't really make 
sense for Go. I don't know of any compelling signals to use here.
   
   My next steps here:
   
   1. Investigate whether a post-process completion heap dump offers any value.
   2. If that doesn't uncover _significant_ value, consider ways we could do 
option (3). I'm not super hopeful here.
   3. Based on the results of the investigation, do any of options (1), (3), or 
(4)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to