On Mon, Jan 24, 2022 at 8:53 AM Matthew Knepley <[email protected]> wrote:
> On Mon, Jan 24, 2022 at 9:24 AM Mark Adams <[email protected]> wrote: > >> What is the fastest way to rebuild hypre? reconfiguring did not work and >> is slow. >> >> I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT (no >> debuggers other than valgrind on Crusher??!?!) >> > Again, apologies for interjecting, but I wanted to offer a few pointers here. 1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This is gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do stepping through a kernel in the source (only the ISA), but you can query device variables in host code, print their values, etc. 1a. Note that multiple threads can be spawned by the HIP runtime. Furthermore, it's likely the thread you'll be on when you catch the error is (one of) the runtime thread(s). You'll need to do `info threads` and then select your host thread to get back to it. 2. To get an accurate stacktrace (meaning get the line in the host code where the error is actually happening), I recommend setting the following environment variables for debugging that will force the serialization of async memcopies and kernel launches: AMD_SERIALIZE_KERNEL = 3 AMD_SERIALIZE_COPY=3 Thanks, Paul
