TNT3530 opened a new issue, #16393:
URL: https://github.com/apache/tvm/issues/16393

   ### Expected behavior
   MLC-LLM should be load the sharded model across all 4 GPUs and start 
inferring.
   Issue is confirmed only with the bridge enabled, adding 
`amdgpu.use_xgmi_p2p=0` to grub config makes the issue stop with no other 
changes, though this reverts back to PCIe P2P only.
   
   Here is the output when attempting to run with `NCCL_DEBUG=INFO`
   
[screenlog.txt](https://github.com/mlc-ai/mlc-llm/files/13852656/screenlog.txt)
   
   ### Actual behavior
   ```
   /src/extlibs/rccl/build/hipify/src/transport/p2p.cc:287 NCCL WARN Cuda 
failure 'invalid argument'
   ```
   ```
   terminate called after throwing an instance of 'tvm::runtime::InternalError'
     what():  [02:18:19] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:196: 
rcclErrror: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
   Stack trace:
     0: _ZN3tvm7runtime6deta
     1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
     2: 
tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void
 (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void 
(*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >)>(void 
(*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >), 
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> 
>)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> 
>::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, 
tvm::runtime::TVMRetValue*)
     3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, 
long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
     4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
     5: 0x00007ff61c0dc252
     6: start_thread
           at ./nptl/pthread_create.c:442
     7: 0x00007ff64cd2665f
           at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
     8: 0xffffffffffffffff
     ```
    
   
   ### Environment
   Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): ROCm 6.0
   Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
   Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): 4x AMD Instinct MI100
   How you installed MLC-LLM (conda, source): conda
   How you installed TVM-Unity (pip, source): pip
   Python version (e.g. 3.10): 3.10.12
   TVM Unity Hash Tag: 
[unity.txt](https://github.com/mlc-ai/mlc-llm/files/13852651/unity.txt)
   
   ### Steps to reproduce
   Install MLC-LLM
   Run the following python code to start loading/inferring
   ```
   cm = ChatModule(model="goliath-120b-q4f16_1", chat_config=ChatConfig(
        max_gen_len=4096,
        conv_template="LM",
        temperature=0.75,
        repetition_penalty=1.1,
        top_p=0.9,
        tensor_parallel_shards=4,
        context_window_size=4096
   ))
   
   output = cm.generate(
       prompt="What is the meaning of life?",
       progress_callback=StreamToStdout(callback_interval=2),
   )
   ```
   
   ### Triage
   I'm not sure where this technically falls under
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to