TNT3530 opened a new issue, #16393: URL: https://github.com/apache/tvm/issues/16393
### Expected behavior MLC-LLM should be load the sharded model across all 4 GPUs and start inferring. Issue is confirmed only with the bridge enabled, adding `amdgpu.use_xgmi_p2p=0` to grub config makes the issue stop with no other changes, though this reverts back to PCIe P2P only. Here is the output when attempting to run with `NCCL_DEBUG=INFO` [screenlog.txt](https://github.com/mlc-ai/mlc-llm/files/13852656/screenlog.txt) ### Actual behavior ``` /src/extlibs/rccl/build/hipify/src/transport/p2p.cc:287 NCCL WARN Cuda failure 'invalid argument' ``` ``` terminate called after throwing an instance of 'tvm::runtime::InternalError' what(): [02:18:19] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:196: rcclErrror: unhandled cuda error (run with NCCL_DEBUG=INFO for details) Stack trace: 0: _ZN3tvm7runtime6deta 1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) 2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*) 3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&) 4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*) 5: 0x00007ff61c0dc252 6: start_thread at ./nptl/pthread_create.c:442 7: 0x00007ff64cd2665f at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 8: 0xffffffffffffffff ``` ### Environment Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0 Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04 Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100 How you installed MLC-LLM (conda, source): conda How you installed TVM-Unity (pip, source): pip Python version (e.g. 3.10): 3.10.12 TVM Unity Hash Tag: [unity.txt](https://github.com/mlc-ai/mlc-llm/files/13852651/unity.txt) ### Steps to reproduce Install MLC-LLM Run the following python code to start loading/inferring ``` cm = ChatModule(model="goliath-120b-q4f16_1", chat_config=ChatConfig( max_gen_len=4096, conv_template="LM", temperature=0.75, repetition_penalty=1.1, top_p=0.9, tensor_parallel_shards=4, context_window_size=4096 )) output = cm.generate( prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2), ) ``` ### Triage I'm not sure where this technically falls under -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org