Hello, Jeff. Please, check attached tar ("auto-failure" dir). There I've seen the following message: --------------------------------------------------------------------------
An internal error has occurred in the Open MPI usNIC BTL. This is highly unusual and shouldn't happen. It suggests that there may be something wrong with the usNIC or OpenFabrics configuration on this server. Server: cn5 Message: usnic connectivity client IPC connect read failed File: /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/usnic/btl_usnic_cclient.c Line: 125 Error: Operation not permitted -------------------------------------------------------------------------- And I was wondered because as I've said we don't use Cisco hardware. My guess that it can be a problem in query function. But I think this shows that usnic BTL somehow participates in computiation. 2014-06-01 19:20 GMT+07:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > Just to be clear: it looks like you haven't seen any errors from the usnic > BTL, right? (the Cisco VIC uses the usnic BTL only -- it does not use the > openib BTL) > > > On Jun 1, 2014, at 2:57 AM, Artem Polyakov <artpo...@gmail.com> wrote: > > > Hello, while testing new PMI implementation I faced a problem with > OpenIB and/or usNIC support. > > The cluster I use is build on Mellanox QDR. We don't use Cisco hardware, > thus no Cisco Virtual Interface Card. To exclude possibility of new PMI > code influence I used mpirun to launch the job. Slurm job script is > attached. > > > > While investigating the problem I found the following: > > 1. With TCP btl everything works without errors (add export > OMPI_MCA_btl="tcp,self" in attached batch script). > > > > 2. With fixed OpenIB support (add export OMPI_MCA_btl="openib,self" in > attached batch script) I get followint error: > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > hellompi: > /home/research/artpol/ompi_dev//ompi-trunk_r31907/ompi/mca/btl/openib/btl_openib.c:1151: > mca_btl_openib_del_procs: Assertion > `((opal_object_t*)endpoint)->obj_reference_count == 1' failed. > > > > Complete logs are tar-ed, check "openib-failure" directory. > > > > 3. If I do not fix the BTL component (no OMPI_MCA_btl is exported) I can > get either immediate fail talking about usNIC/OpenIB problems OR programs > hangs. > > For both cases I'm attaching complete tar-ed logs. Check "auto-failure" > dir for ompi stdout and stderr and "auto-hang" for the hang case. > > > > I am ready to provide additional info or help with testing but I have no > time to track the problem myself in near several days. > > > > -- > > С Уважением, Поляков Артем Юрьевич > > Best regards, Artem Y. Polyakov > > > <task_mpirun.job><usnic-openib-faults.tar.bz2>_______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14922.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14926.php -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov