On Mon, Jan 24, 2022 at 9:31 AM Mark Adams <[email protected]> wrote:

> Thanks Paul,
>
> How do I get a stack trace? I have been relying on PETSc's
> which piggybacks on timers so it is not getting too deep here.
>

I'm not sure what the "PETSc way" is, but I just run the executable through
`rocgdb` as one would do with `gdb` (`rocgdb` is literally `gdb` built with
extra AMD stuff (that stuff is either upstreamed or being upstreamed to gdb
BTW)). You can do it in batch mode as well so you can dump the logs from
each MPI process.


>
> On Mon, Jan 24, 2022 at 10:16 AM Paul T. Bauman <[email protected]>
> wrote:
>
>> On Mon, Jan 24, 2022 at 8:53 AM Matthew Knepley <[email protected]>
>> wrote:
>>
>>> On Mon, Jan 24, 2022 at 9:24 AM Mark Adams <[email protected]> wrote:
>>>
>>>> What is the fastest way to rebuild hypre? reconfiguring did not work
>>>> and is slow.
>>>>
>>>> I am printf debugging to find this HSA_STATUS_ERROR_MEMORY_FAULT  (no
>>>> debuggers other than valgrind on Crusher??!?!)
>>>>
>>>
>> Again, apologies for interjecting, but I wanted to offer a few pointers
>> here.
>>
>> 1. `rocgdb` will be in your PATH when the `rocm` module is loaded. This
>> is gdb, but with some extra AMDGPU goodies. AFAIK, you cannot, yet, do
>> stepping through a kernel in the source (only the ISA), but you can query
>> device variables in host code, print their values, etc.
>> 1a. Note that multiple threads can be spawned by the HIP runtime.
>> Furthermore, it's likely the thread you'll be on when you catch the error
>> is (one of) the runtime thread(s). You'll need to do `info threads` and
>> then select your host thread to get back to it.
>> 2. To get an accurate stacktrace (meaning get the line in the host code
>> where the error is actually happening), I recommend setting the following
>> environment variables for debugging that will force the serialization of
>> async memcopies and kernel launches:
>> AMD_SERIALIZE_KERNEL = 3
>> AMD_SERIALIZE_COPY=3
>>
>> Thanks,
>>
>> Paul
>>
>

Reply via email to