r31904 should fix this issue. Please test it thoughtfully and report all issues.

  George.


On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
> and attached a patch for the v1.8 branch
>
> i ran several tests from the intel_tests test suite and did not observe
> any regression.
>
> please note there are still issues when running with --mca btl
> scif,vader,self
>
> this might be an other issue, i will investigate more next week
>
> Gilles
>
> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
>> I ran some more investigations with --mca btl scif,self
>>
>> i found that the previous patch i posted was complete crap and i
>> apologize for it.
>>
>> on a brighter side, and imho, the issue only occurs if fragments are
>> received (and then processed) out of order.
>> /* i did not observe this with the tcp btl, but i always see that with
>> the scif btl, i guess this can be observed too
>> with openib+RDMA */
>>
>> in this case only, opal_convertor_generic_simple_position(...) is
>> invoked and does not set the pConvertor->pStack
>> as expected by r31496
>>
>> i will run some more tests from now
>>
>> Gilles
>>
>> On 2014/05/08 2:23, George Bosilca wrote:
>>> Strange. The outcome and the timing of this issue seems to highlight a link 
>>> with the other datatype-related issue you reported earlier, and as 
>>> suggested by Ralph with Gilles scif+vader issue.
>>>
>>> Generally speaking, the mechanism used to split the data in the case of 
>>> multiple BTLs, is identical to the one used to split the data in fragments. 
>>> So, if the culprit is in the splitting logic, one might see some weirdness 
>>> as soon as we force the exclusive usage of the send protocol, with an 
>>> unconventional fragment size.
>>>
>>> In other words using the following flags “—mca btl tcp,self —mca 
>>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 
>>> 23 —mca btl_tcp_max_send_size 23” should always transfer wrong data, even 
>>> when only one single BTL is in play.
>>>
>>>   George.
>>>
>>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
>>>
>>>> OK.  So, I investigated a little more.  I only see the issue when I am 
>>>> running with multiple ports enabled such that I have two openib BTLs 
>>>> instantiated.  In addition, large message RDMA has to be enabled.  If 
>>>> those conditions are not met, then I do not see the problem.  For example:
>>>> FAILS:
>>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
>>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>>>> PASS:
>>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
>>>> btl_openib_flags 3 MPI_Isend_ator_c
>>>> Ø  mpirun –np 2 –host host1,host2 –mca 
>>>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
>>>> MPI_Isend_ator_c
>>>>
>>>> So we must have some type of issue when we break up the message between 
>>>> the two openib BTLs.  Maybe someone else can confirm my observations?
>>>> I was testing against the latest trunk.
>>>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14766.php

Reply via email to