I see the problem. The openib btl does not properly handle the following call sequence (this is an openib btl bug IMHO):
btl_sendi (..., &descriptor); btl_free (..., descriptor); The bug is in the message coalescing code and it looks like extra logic needs to be added to the openib btl's btl_free function for this to work properly. I am working on a fix now. -Nathan On Mon, Nov 03, 2014 at 04:26:10PM +0200, Alina Sklarevich wrote: > Hi, > On 1.8.4rc1 we observe the following assert in the osu_mbw_mr test when > using the openib BTL. > When compiled in production mode (i.e. no --enable-debug) the test simply > hangs. > When using either the tcp BTL or the cm PML, the benchmark completes > without error. > The command line to reproduce this is: > $ mpirun --bind-to core -display-map -mca btl_openib_if_include mlx5_0:1 > -np 2 -mca pml ob1 -mca btl openib,self,sm ./osu_mbw_mr > # OSU MPI Multiple Bandwidth / Message Rate Test v4.4 > # [ pairs: 1 ] [ window size: 64 ] > # Size MB/s Messages/s > osu_mbw_mr: ../../../../opal/class/opal_list.h:547: _opal_list_append: > Assertion `0 == item->opal_list_item_refcount' failed. > [vegas15:30395] *** Process received signal *** > [vegas15:30395] Signal: Aborted (6) > [vegas15:30395] Signal code: (-6) > [vegas15:30395] [ 0] /lib64/libpthread.so.0[0x30bc40f500] > [vegas15:30395] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x30bc0328a5] > [vegas15:30395] [ 2] /lib64/libc.so.6(abort+0x175)[0x30bc034085] > [vegas15:30395] [ 3] /lib64/libc.so.6[0x30bc02ba1e] > [vegas15:30395] [ 4] > /lib64/libc.so.6(__assert_perror_fail+0x0)[0x30bc02bae0] > [vegas15:30395] [ 5] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(+0x9087)[0x7ffff3f70087] > [vegas15:30395] [ 6] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x403)[0x7ffff3f754b3] > [vegas15:30395] [ 7] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0xf9e)[0x7ffff3f785b4] > [vegas15:30395] [ 8] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xed08)[0x7ffff3308d08] > [vegas15:30395] [ 9] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xf8ba)[0x7ffff33098ba] > [vegas15:30395] [10] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x108)[0x7ffff3309a1f] > [vegas15:30395] [11] > > /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/libmpi.so.1(MPI_Isend+0x2ec)[0x7ffff7cff5e8] > [vegas15:30395] [12] > > /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400fa4] > [vegas15:30395] [13] > > /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x40167d] > [vegas15:30395] [14] > /lib64/libc.so.6(__libc_start_main+0xfd)[0x30bc01ecdd] > [vegas15:30395] [15] > > /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400db9] > [vegas15:30395] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 30395 on node vegas15 exited > on signal 6 (Aborted). > -------------------------------------------------------------------------- > Thanks, > Alina. > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16142.php
pgp0N7VE22Bta.pgp
Description: PGP signature