The removed "function that never worked correctly" that I was referring to was 
ncclCommitInitRankMulti.

Canonical used my draft rccl 6.4 update and packaged the ROCm 7.1.0 version of 
rccl currently included in Ubuntu Resolute.

I have not yet reviewed what they did (otherwise, I would probably have already 
ported it back to Debian Experimental with whatever changes I felt necessary). 
I suspect rccl 7.1.0 may be able to build and run on the HIP Runtime 6.4 on 
Unstable, which would make that slightly easier.

You might be able to loosen the B-D from the Ubuntu rccl package and have it 
build and run successfully on Unstable. Most of the breaking changes in the HIP 
Runtime from 6.4 to 7.1 were things becoming stricter, so libraries written 
against the newer runtime might still be source compatible with the older 
runtime (but vice versa is unlikely, at least for big, complex libraries).

The rccl 7.1.0 + rocm-hipamd 6.4.4 combination won't have been tested by 
upstream. One risk is that there were significant changes to hipGetLastError 
behavior between ROCm 6 and ROCm 7, so it's possible that version mix could 
cause some subtle bugs at runtime. I think it's probably fine, though. If the 
unit tests pass, that would be a good sign.

In any case, I hope this information is helpful.

Sincerely,
qCory Blooe

Reply via email to