Thanks Paul, How do I get a stack trace? I have been relying on PETSc's which piggybacks on timers so it is not getting too deep here.
On Mon, Jan 24, 2022 at 10:16 AM Paul T. Bauman <ptbau...@gmail.com> wrote: > On Mon, Jan 24, 2022 at 8:53 AM Matthew Knepley <knep...@gmail.com> wrote: > >> On Mon, Jan 24, 2022 at 9:24 AM Mark Adams <mfad...@lbl.gov> wrote: >> >>> What is the fastest way to rebuild hypre? reconfiguring did not work and >>> is slow. >>> >>> I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT (no >>> debuggers other than valgrind on Crusher??!?!) >>> >> > Again, apologies for interjecting, but I wanted to offer a few pointers > here. > > 1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This is > gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do > stepping through a kernel in the source (only the ISA), but you can query > device variables in host code, print their values, etc. > 1a. Note that multiple threads can be spawned by the HIP runtime. > Furthermore, it's likely the thread you'll be on when you catch the error > is (one of) the runtime thread(s). You'll need to do `info threads` and > then select your host thread to get back to it. > 2. To get an accurate stacktrace (meaning get the line in the host code > where the error is actually happening), I recommend setting the following > environment variables for debugging that will force the serialization of > async memcopies and kernel launches: > AMD_SERIALIZE_KERNEL = 3 > AMD_SERIALIZE_COPY=3 > > Thanks, > > Paul >