The removed "function that never worked correctly" that I was referring to was ncclCommitInitRankMulti.
Canonical used my draft rccl 6.4 update and packaged the ROCm 7.1.0 version of rccl currently included in Ubuntu Resolute. I have not yet reviewed what they did (otherwise, I would probably have already ported it back to Debian Experimental with whatever changes I felt necessary). I suspect rccl 7.1.0 may be able to build and run on the HIP Runtime 6.4 on Unstable, which would make that slightly easier. You might be able to loosen the B-D from the Ubuntu rccl package and have it build and run successfully on Unstable. Most of the breaking changes in the HIP Runtime from 6.4 to 7.1 were things becoming stricter, so libraries written against the newer runtime might still be source compatible with the older runtime (but vice versa is unlikely, at least for big, complex libraries). The rccl 7.1.0 + rocm-hipamd 6.4.4 combination won't have been tested by upstream. One risk is that there were significant changes to hipGetLastError behavior between ROCm 6 and ROCm 7, so it's possible that version mix could cause some subtle bugs at runtime. I think it's probably fine, though. If the unit tests pass, that would be a good sign. In any case, I hope this information is helpful. Sincerely, qCory Blooe

