I ran some more investigations with --mca btl scif,self i found that the previous patch i posted was complete crap and i apologize for it.
on a brighter side, and imho, the issue only occurs if fragments are received (and then processed) out of order. /* i did not observe this with the tcp btl, but i always see that with the scif btl, i guess this can be observed too with openib+RDMA */ in this case only, opal_convertor_generic_simple_position(...) is invoked and does not set the pConvertor->pStack as expected by r31496 i will run some more tests from now Gilles On 2014/05/08 2:23, George Bosilca wrote: > Strange. The outcome and the timing of this issue seems to highlight a link > with the other datatype-related issue you reported earlier, and as suggested > by Ralph with Gilles scif+vader issue. > > Generally speaking, the mechanism used to split the data in the case of > multiple BTLs, is identical to the one used to split the data in fragments. > So, if the culprit is in the splitting logic, one might see some weirdness as > soon as we force the exclusive usage of the send protocol, with an > unconventional fragment size. > > In other words using the following flags “—mca btl tcp,self —mca > btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 > —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when > only one single BTL is in play. > > George. > > On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > >> OK. So, I investigated a little more. I only see the issue when I am >> running with multiple ports enabled such that I have two openib BTLs >> instantiated. In addition, large message RDMA has to be enabled. If those >> conditions are not met, then I do not see the problem. For example: >> FAILS: >> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include >> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c >> PASS: >> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca >> btl_openib_flags 3 MPI_Isend_ator_c >> Ø mpirun –np 2 –host host1,host2 –mca >> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 >> MPI_Isend_ator_c >> >> So we must have some type of issue when we break up the message between the >> two openib BTLs. Maybe someone else can confirm my observations? >> I was testing against the latest trunk. >>