I see the problem. The openib btl does not properly handle the following
call sequence (this is an openib btl bug IMHO):

btl_sendi (..., &descriptor);
btl_free (..., descriptor);

The bug is in the message coalescing code and it looks like extra logic
needs to be added to the openib btl's btl_free function for this to work
properly. I am working on a fix now.

-Nathan

On Mon, Nov 03, 2014 at 04:26:10PM +0200, Alina Sklarevich wrote:
>    Hi,
>    On 1.8.4rc1 we observe the following assert in the osu_mbw_mr test when
>    using the openib BTL.
>    When compiled in production mode (i.e. no --enable-debug) the test simply
>    hangs.
>    When using either the tcp BTL or the cm PML, the benchmark completes
>    without error.
>    The command line to reproduce this is:
>    $ mpirun --bind-to core -display-map -mca btl_openib_if_include mlx5_0:1
>    -np 2 -mca pml ob1 -mca btl openib,self,sm ./osu_mbw_mr
>    # OSU MPI Multiple Bandwidth / Message Rate Test v4.4
>    # [ pairs: 1 ] [ window size: 64 ]
>    # Size                  MB/s        Messages/s
>    osu_mbw_mr: ../../../../opal/class/opal_list.h:547: _opal_list_append:
>    Assertion `0 == item->opal_list_item_refcount' failed.
>    [vegas15:30395] *** Process received signal ***
>    [vegas15:30395] Signal: Aborted (6)
>    [vegas15:30395] Signal code:  (-6)
>    [vegas15:30395] [ 0] /lib64/libpthread.so.0[0x30bc40f500]
>    [vegas15:30395] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x30bc0328a5]
>    [vegas15:30395] [ 2] /lib64/libc.so.6(abort+0x175)[0x30bc034085]
>    [vegas15:30395] [ 3] /lib64/libc.so.6[0x30bc02ba1e]
>    [vegas15:30395] [ 4]
>    /lib64/libc.so.6(__assert_perror_fail+0x0)[0x30bc02bae0]
>    [vegas15:30395] [ 5]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(+0x9087)[0x7ffff3f70087]
>    [vegas15:30395] [ 6]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x403)[0x7ffff3f754b3]
>    [vegas15:30395] [ 7]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0xf9e)[0x7ffff3f785b4]
>    [vegas15:30395] [ 8]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xed08)[0x7ffff3308d08]
>    [vegas15:30395] [ 9]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(+0xf8ba)[0x7ffff33098ba]
>    [vegas15:30395] [10]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x108)[0x7ffff3309a1f]
>    [vegas15:30395] [11]
>    
> /labhome/alinas/workspace/tt/ompi_rc1/openmpi-1.8.4rc1/install/lib/libmpi.so.1(MPI_Isend+0x2ec)[0x7ffff7cff5e8]
>    [vegas15:30395] [12]
>    
> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400fa4]
>    [vegas15:30395] [13]
>    
> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x40167d]
>    [vegas15:30395] [14]
>    /lib64/libc.so.6(__libc_start_main+0xfd)[0x30bc01ecdd]
>    [vegas15:30395] [15]
>    
> /hpc/local/benchmarks/hpc-stack-gcc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4/osu_mbw_mr[0x400db9]
>    [vegas15:30395] *** End of error message ***
>    --------------------------------------------------------------------------
>    mpirun noticed that process rank 0 with PID 30395 on node vegas15 exited
>    on signal 6 (Aborted).
>    --------------------------------------------------------------------------
>    Thanks,
>    Alina.

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16142.php

Attachment: pgp0N7VE22Bta.pgp
Description: PGP signature

Reply via email to