Hi, It looks like this fix resolved our problems as well.
Thanks, Elena On Fri, May 30, 2014 at 4:58 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > This fixed all of my issues. Thanks. I will add that comment to ticket > also. > > >-----Original Message----- > >From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George > >Bosilca > >Sent: Thursday, May 29, 2014 5:58 PM > >To: Open MPI Developers > >Subject: Re: [OMPI devel] regression with derived datatypes > > > >r31904 should fix this issue. Please test it thoughtfully and report all > issues. > > > > George. > > > > > >On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet > ><gilles.gouaillar...@iferc.org> wrote: > >> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610 > >> and attached a patch for the v1.8 branch > >> > >> i ran several tests from the intel_tests test suite and did not > >> observe any regression. > >> > >> please note there are still issues when running with --mca btl > >> scif,vader,self > >> > >> this might be an other issue, i will investigate more next week > >> > >> Gilles > >> > >> On 2014/05/09 18:08, Gilles Gouaillardet wrote: > >>> I ran some more investigations with --mca btl scif,self > >>> > >>> i found that the previous patch i posted was complete crap and i > >>> apologize for it. > >>> > >>> on a brighter side, and imho, the issue only occurs if fragments are > >>> received (and then processed) out of order. > >>> /* i did not observe this with the tcp btl, but i always see that > >>> with the scif btl, i guess this can be observed too with openib+RDMA > >>> */ > >>> > >>> in this case only, opal_convertor_generic_simple_position(...) is > >>> invoked and does not set the pConvertor->pStack as expected by r31496 > >>> > >>> i will run some more tests from now > >>> > >>> Gilles > >>> > >>> On 2014/05/08 2:23, George Bosilca wrote: > >>>> Strange. The outcome and the timing of this issue seems to highlight > a link > >with the other datatype-related issue you reported earlier, and as > suggested > >by Ralph with Gilles scif+vader issue. > >>>> > >>>> Generally speaking, the mechanism used to split the data in the case > of > >multiple BTLs, is identical to the one used to split the data in > fragments. So, if > >the culprit is in the splitting logic, one might see some weirdness as > soon as > >we force the exclusive usage of the send protocol, with an unconventional > >fragment size. > >>>> > >>>> In other words using the following flags “—mca btl tcp,self —mca > >btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit > >23 —mca btl_tcp_max_send_size 23” should always transfer wrong data, > >even when only one single BTL is in play. > >>>> > >>>> George. > >>>> > >>>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com> > >wrote: > >>>> > >>>>> OK. So, I investigated a little more. I only see the issue when I > am > >running with multiple ports enabled such that I have two openib BTLs > >instantiated. In addition, large message RDMA has to be enabled. If > those > >conditions are not met, then I do not see the problem. For example: > >>>>> FAILS: > >>>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include > >>>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c > >>>>> PASS: > >>>>> Ø mpirun –np 2 –host host1,host2 –mca btl_openib_if_include > >>>>> mlx5_0:1 –mca btl_openib_flags 3 MPI_Isend_ator_c Ø mpirun –np 2 > >>>>> –host host1,host2 –mca btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca > >>>>> btl_openib_flags 1 MPI_Isend_ator_c > >>>>> > >>>>> So we must have some type of issue when we break up the message > >between the two openib BTLs. Maybe someone else can confirm my > >observations? > >>>>> I was testing against the latest trunk. > >>>>> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/05/14766.php > >_______________________________________________ > >devel mailing list > >de...@open-mpi.org > >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: http://www.open- > >mpi.org/community/lists/devel/2014/05/14910.php > > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and > may contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > > ----------------------------------------------------------------------------------- > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14912.php