Hi,

It looks like this fix resolved our problems as well.

Thanks,
Elena


On Fri, May 30, 2014 at 4:58 PM, Rolf vandeVaart <rvandeva...@nvidia.com>
wrote:

> This fixed all of my issues.  Thanks.  I will add that comment to ticket
> also.
>
> >-----Original Message-----
> >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George
> >Bosilca
> >Sent: Thursday, May 29, 2014 5:58 PM
> >To: Open MPI Developers
> >Subject: Re: [OMPI devel] regression with derived datatypes
> >
> >r31904 should fix this issue. Please test it thoughtfully and report all
> issues.
> >
> >  George.
> >
> >
> >On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
> ><gilles.gouaillar...@iferc.org> wrote:
> >> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
> >> and attached a patch for the v1.8 branch
> >>
> >> i ran several tests from the intel_tests test suite and did not
> >> observe any regression.
> >>
> >> please note there are still issues when running with --mca btl
> >> scif,vader,self
> >>
> >> this might be an other issue, i will investigate more next week
> >>
> >> Gilles
> >>
> >> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
> >>> I ran some more investigations with --mca btl scif,self
> >>>
> >>> i found that the previous patch i posted was complete crap and i
> >>> apologize for it.
> >>>
> >>> on a brighter side, and imho, the issue only occurs if fragments are
> >>> received (and then processed) out of order.
> >>> /* i did not observe this with the tcp btl, but i always see that
> >>> with the scif btl, i guess this can be observed too with openib+RDMA
> >>> */
> >>>
> >>> in this case only, opal_convertor_generic_simple_position(...) is
> >>> invoked and does not set the pConvertor->pStack as expected by r31496
> >>>
> >>> i will run some more tests from now
> >>>
> >>> Gilles
> >>>
> >>> On 2014/05/08 2:23, George Bosilca wrote:
> >>>> Strange. The outcome and the timing of this issue seems to highlight
> a link
> >with the other datatype-related issue you reported earlier, and as
> suggested
> >by Ralph with Gilles scif+vader issue.
> >>>>
> >>>> Generally speaking, the mechanism used to split the data in the case
> of
> >multiple BTLs, is identical to the one used to split the data in
> fragments. So, if
> >the culprit is in the splitting logic, one might see some weirdness as
> soon as
> >we force the exclusive usage of the send protocol, with an unconventional
> >fragment size.
> >>>>
> >>>> In other words using the following flags “—mca btl tcp,self —mca
> >btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit
> >23 —mca btl_tcp_max_send_size 23” should always transfer wrong data,
> >even when only one single BTL is in play.
> >>>>
> >>>>   George.
> >>>>
> >>>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com>
> >wrote:
> >>>>
> >>>>> OK.  So, I investigated a little more.  I only see the issue when I
> am
> >running with multiple ports enabled such that I have two openib BTLs
> >instantiated.  In addition, large message RDMA has to be enabled.  If
> those
> >conditions are not met, then I do not see the problem.  For example:
> >>>>> FAILS:
> >>>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
> >>>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
> >>>>> PASS:
> >>>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
> >>>>> mlx5_0:1 –mca btl_openib_flags 3 MPI_Isend_ator_c Ø  mpirun –np 2
> >>>>> –host host1,host2 –mca btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca
> >>>>> btl_openib_flags 1 MPI_Isend_ator_c
> >>>>>
> >>>>> So we must have some type of issue when we break up the message
> >between the two openib BTLs.  Maybe someone else can confirm my
> >observations?
> >>>>> I was testing against the latest trunk.
> >>>>>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14766.php
> >_______________________________________________
> >devel mailing list
> >de...@open-mpi.org
> >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >Link to this post: http://www.open-
> >mpi.org/community/lists/devel/2014/05/14910.php
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14912.php

Reply via email to