I ran some more investigations with --mca btl scif,self

i found that the previous patch i posted was complete crap and i
apologize for it.

on a brighter side, and imho, the issue only occurs if fragments are
received (and then processed) out of order.
/* i did not observe this with the tcp btl, but i always see that with
the scif btl, i guess this can be observed too
with openib+RDMA */

in this case only, opal_convertor_generic_simple_position(...) is
invoked and does not set the pConvertor->pStack
as expected by r31496

i will run some more tests from now

Gilles

On 2014/05/08 2:23, George Bosilca wrote:
> Strange. The outcome and the timing of this issue seems to highlight a link 
> with the other datatype-related issue you reported earlier, and as suggested 
> by Ralph with Gilles scif+vader issue.
>
> Generally speaking, the mechanism used to split the data in the case of 
> multiple BTLs, is identical to the one used to split the data in fragments. 
> So, if the culprit is in the splitting logic, one might see some weirdness as 
> soon as we force the exclusive usage of the send protocol, with an 
> unconventional fragment size.
>
> In other words using the following flags “—mca btl tcp,self —mca 
> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 
> —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when 
> only one single BTL is in play.
>
>   George.
>
> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
>
>> OK.  So, I investigated a little more.  I only see the issue when I am 
>> running with multiple ports enabled such that I have two openib BTLs 
>> instantiated.  In addition, large message RDMA has to be enabled.  If those 
>> conditions are not met, then I do not see the problem.  For example:
>> FAILS:
>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>> PASS:
>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
>> btl_openib_flags 3 MPI_Isend_ator_c
>> Ø  mpirun –np 2 –host host1,host2 –mca 
>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
>> MPI_Isend_ator_c
>>  
>> So we must have some type of issue when we break up the message between the 
>> two openib BTLs.  Maybe someone else can confirm my observations?
>> I was testing against the latest trunk.
>>

Reply via email to