On 2024-02-26 07:23, Cordell Bloor wrote:
> The update of rocsparse 5.5.1 to 5.7.1 seems to have caused a regression
> in hipsparse. Although, it's also possible that this problem was because
> rocsparse was therefore rebuilt with the updated rocprim 5.7.1.

This one looks a bit tricky as it also seems to be GPU-arch dependent.

The following is all going by the logs of gfx1034 [1]:

One issue is that the test suite is often aborted early, so the csr2bsr
tests (from this bug) don't even get run. From the tail of the latest
log in experimental [2]:

>  82s [ RUN      ] dense2csr/parameterized_dense2csr.dense2csr_float/158
>  82s [       OK ] dense2csr/parameterized_dense2csr.dense2csr_float/158 (0 ms)
>  82s [ RUN      ] dense2csr/parameterized_dense2csr.dense2csr_float/159
>  82s [       OK ] dense2csr/parameterized_dense2csr.dense2csr_float/159 (0 ms)
>  82s [ RUN      ] dense2csr/parameterized_dense2csr.dense2csr_float/160
>  82s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  82s hipSPARSE error: HIPSPARSE_STATUS_INTERNAL_ERROR
>  82s double free or corruption (out)
>  82s Aborted
>  82s autopkgtest [00:40:28]: test command1: -----------------------]
>  83s command1             FAIL non-zero exit status 1
>  83s autopkgtest [00:40:29]: test command1:  - - - - - - - - - - results - - 
> - - - - - - - -
>  84s autopkgtest [00:40:30]: @@@@@@@@@@@@@@@@@@@@ summary
>  84s command1             FAIL non-zero exit status 1

The last fully completed run (no abort) was on 2024-01-27 [3], with a
runtime of 3m38s. And then the rocsparse-5.7.1 upgrade happens, and
indeed, after that update, no other successful completion can be seen,
suggesting that this might be a factor.

But, within the preceding 24hrs, we had two other test runs [4,5] abort
early -- with rocsparse=5.5.1-2.

Some of the earlier logs have more informative error messages prior to
the abort, eg:

>  85s Memory access fault by GPU node-1 (Agent handle: 0x5608f9e506c0) on 
> address 0x7fa318408000. Reason: Page not present or supervisor privilege.
>  85s Nearby memory map:
> [...]
> 85s hipsparse-test: ./src/core/runtime/runtime.cpp:1276: static bool 
> rocr::core::Runtime::VMFaultHandler(hsa_signal_value_t, void*): Assertion 
> `false && "GPU memory access fault."' failed.
>  85s ./clients/common/unit.cpp:128: Failure

I think these tests were all in VMs, I'll try to reproduce them on bare
metal just to be sure.

And to make things even more interesting, the test history of gfx1030
[6] suggests that rocsparse indeed was a factor. Tests on gfx1030 passed
until rocsparse=5.7.1-2, then failed, and now pass again with
hipsparse=5.7.1-1~exp1.

Best,
Christian

[1]: https://ci.rocm.debian.net/packages/h/hipsparse/unstable/amd64+gfx1034/
[2]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1034/h/hipsparse/7767/log.gz
[3]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1034/h/hipsparse/5234/log.gz
[4]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1034/h/hipsparse/5075/log.gz
[5]: 
https://ci.rocm.debian.net/data/autopkgtest/unstable/amd64+gfx1034/h/hipsparse/5003/log.gz

Reply via email to