Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Ben Menadue via devel
Hi,

We see this on our cluster as well — we traced it to because Python loads 
shared library extensions using RTLD_LOCAL.

The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a 
dependency on libhcoll.so. So the Python module is being loaded with 
RTLD_LOCAL, anything that it pulls in with it also ends up being loaded like 
that. Later, hcoll tries loading its own plugin .so files, but since 
libhcoll.so was loaded with RTLD_LOCAL that plugin library can’t resolve any 
symbols there.

It might be fixable by having the hcoll plugins linked against libhcoll.so, but 
since it’s just a pre-built bundle from Mellanox it’s not something I can test 
easily.

Otherwise, the solution we use is to just LD_PRELOAD=libmpi.so when launching 
Python so that it gets loaded into the global namespace like would happen with 
a “normal” compiled program.

Cheers,
Ben



> On 8 Nov 2022, at 1:48 am, Tomislav Janjusic via devel 
>  wrote:
> 
> Ugh - runtime command is literally in the e-mail.
>  
> Sorry about that.
>  
>  
> --
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA <http://www.nvidia.com/>
>  
> From: Tomislav Janjusic 
> Sent: Monday, November 7, 2022 8:48 AM
> To: 'Open MPI Developers' ; Open MPI Users 
> 
> Cc: mrlong 
> Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available 
> but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> What is the runtime command?
> It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
> coll ^hcoll
>  
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA <http://www.nvidia.com/>
>  
> From: devel  <mailto:devel-boun...@lists.open-mpi.org>> On Behalf Of mrlong via devel
> Sent: Monday, November 7, 2022 2:33 AM
> To: devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>; Open MPI 
> Users mailto:us...@lists.open-mpi.org>>
> Cc: mrlong mailto:mrlong...@gmail.com>>
> Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
> requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> External email: Use caution opening links or attachments
>  
> The execution of openmpi 5.0.0rc9 results in the following:
> 
> (py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> 
> Why is this message printed?
> 



Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Tomislav Janjusic via devel
Ugh - runtime command is literally in the e-mail.

Sorry about that.


--
Tomislav Janjusic
Staff Eng., Mellanox, HPC SW
+1 (512) 598-0386
NVIDIA<http://www.nvidia.com/>

From: Tomislav Janjusic
Sent: Monday, November 7, 2022 8:48 AM
To: 'Open MPI Developers' ; Open MPI Users 

Cc: mrlong 
Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, 
p2p

What is the runtime command?
It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
coll ^hcoll

Tomislav Janjusic
Staff Eng., Mellanox, HPC SW
+1 (512) 598-0386
NVIDIA<http://www.nvidia.com/>

From: devel 
mailto:devel-boun...@lists.open-mpi.org>> On 
Behalf Of mrlong via devel
Sent: Monday, November 7, 2022 2:33 AM
To: devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>; Open MPI Users 
mailto:us...@lists.open-mpi.org>>
Cc: mrlong mailto:mrlong...@gmail.com>>
Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, 
p2p

External email: Use caution opening links or attachments


The execution of openmpi 5.0.0rc9 results in the following:

(py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error

Why is this message printed?


Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Tomislav Janjusic via devel
What is the runtime command?
It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
coll ^hcoll

Tomislav Janjusic
Staff Eng., Mellanox, HPC SW
+1 (512) 598-0386
NVIDIA

From: devel  On Behalf Of mrlong via devel
Sent: Monday, November 7, 2022 2:33 AM
To: devel@lists.open-mpi.org; Open MPI Users 
Cc: mrlong 
Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, 
p2p

External email: Use caution opening links or attachments


The execution of openmpi 5.0.0rc9 results in the following:

(py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error
[LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
[LOG_CAT_ML] ml_discover_hierarchy exited with error

Why is this message printed?