Re: [OMPI devel] regression with derived datatypes

2014-05-30 Thread Rolf vandeVaart
This fixed all of my issues.  Thanks.  I will add that comment to ticket also.

>-Original Message-
>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George
>Bosilca
>Sent: Thursday, May 29, 2014 5:58 PM
>To: Open MPI Developers
>Subject: Re: [OMPI devel] regression with derived datatypes
>
>r31904 should fix this issue. Please test it thoughtfully and report all 
>issues.
>
>  George.
>
>
>On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
><gilles.gouaillar...@iferc.org> wrote:
>> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
>> and attached a patch for the v1.8 branch
>>
>> i ran several tests from the intel_tests test suite and did not
>> observe any regression.
>>
>> please note there are still issues when running with --mca btl
>> scif,vader,self
>>
>> this might be an other issue, i will investigate more next week
>>
>> Gilles
>>
>> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
>>> I ran some more investigations with --mca btl scif,self
>>>
>>> i found that the previous patch i posted was complete crap and i
>>> apologize for it.
>>>
>>> on a brighter side, and imho, the issue only occurs if fragments are
>>> received (and then processed) out of order.
>>> /* i did not observe this with the tcp btl, but i always see that
>>> with the scif btl, i guess this can be observed too with openib+RDMA
>>> */
>>>
>>> in this case only, opal_convertor_generic_simple_position(...) is
>>> invoked and does not set the pConvertor->pStack as expected by r31496
>>>
>>> i will run some more tests from now
>>>
>>> Gilles
>>>
>>> On 2014/05/08 2:23, George Bosilca wrote:
>>>> Strange. The outcome and the timing of this issue seems to highlight a link
>with the other datatype-related issue you reported earlier, and as suggested
>by Ralph with Gilles scif+vader issue.
>>>>
>>>> Generally speaking, the mechanism used to split the data in the case of
>multiple BTLs, is identical to the one used to split the data in fragments. 
>So, if
>the culprit is in the splitting logic, one might see some weirdness as soon as
>we force the exclusive usage of the send protocol, with an unconventional
>fragment size.
>>>>
>>>> In other words using the following flags “—mca btl tcp,self —mca
>btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit
>23 —mca btl_tcp_max_send_size 23” should always transfer wrong data,
>even when only one single BTL is in play.
>>>>
>>>>   George.
>>>>
>>>> On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com>
>wrote:
>>>>
>>>>> OK.  So, I investigated a little more.  I only see the issue when I am
>running with multiple ports enabled such that I have two openib BTLs
>instantiated.  In addition, large message RDMA has to be enabled.  If those
>conditions are not met, then I do not see the problem.  For example:
>>>>> FAILS:
>>>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
>>>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>>>>> PASS:
>>>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include
>>>>> mlx5_0:1 –mca btl_openib_flags 3 MPI_Isend_ator_c Ø  mpirun –np 2
>>>>> –host host1,host2 –mca btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca
>>>>> btl_openib_flags 1 MPI_Isend_ator_c
>>>>>
>>>>> So we must have some type of issue when we break up the message
>between the two openib BTLs.  Maybe someone else can confirm my
>observations?
>>>>> I was testing against the latest trunk.
>>>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14766.php
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: http://www.open-
>mpi.org/community/lists/devel/2014/05/14910.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI devel] regression with derived datatypes

2014-05-29 Thread George Bosilca
r31904 should fix this issue. Please test it thoughtfully and report all issues.

  George.


On Fri, May 9, 2014 at 6:56 AM, Gilles Gouaillardet
 wrote:
> i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
> and attached a patch for the v1.8 branch
>
> i ran several tests from the intel_tests test suite and did not observe
> any regression.
>
> please note there are still issues when running with --mca btl
> scif,vader,self
>
> this might be an other issue, i will investigate more next week
>
> Gilles
>
> On 2014/05/09 18:08, Gilles Gouaillardet wrote:
>> I ran some more investigations with --mca btl scif,self
>>
>> i found that the previous patch i posted was complete crap and i
>> apologize for it.
>>
>> on a brighter side, and imho, the issue only occurs if fragments are
>> received (and then processed) out of order.
>> /* i did not observe this with the tcp btl, but i always see that with
>> the scif btl, i guess this can be observed too
>> with openib+RDMA */
>>
>> in this case only, opal_convertor_generic_simple_position(...) is
>> invoked and does not set the pConvertor->pStack
>> as expected by r31496
>>
>> i will run some more tests from now
>>
>> Gilles
>>
>> On 2014/05/08 2:23, George Bosilca wrote:
>>> Strange. The outcome and the timing of this issue seems to highlight a link 
>>> with the other datatype-related issue you reported earlier, and as 
>>> suggested by Ralph with Gilles scif+vader issue.
>>>
>>> Generally speaking, the mechanism used to split the data in the case of 
>>> multiple BTLs, is identical to the one used to split the data in fragments. 
>>> So, if the culprit is in the splitting logic, one might see some weirdness 
>>> as soon as we force the exclusive usage of the send protocol, with an 
>>> unconventional fragment size.
>>>
>>> In other words using the following flags “—mca btl tcp,self —mca 
>>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 
>>> 23 —mca btl_tcp_max_send_size 23” should always transfer wrong data, even 
>>> when only one single BTL is in play.
>>>
>>>   George.
>>>
>>> On May 7, 2014, at 13:11 , Rolf vandeVaart  wrote:
>>>
 OK.  So, I investigated a little more.  I only see the issue when I am 
 running with multiple ports enabled such that I have two openib BTLs 
 instantiated.  In addition, large message RDMA has to be enabled.  If 
 those conditions are not met, then I do not see the problem.  For example:
 FAILS:
 Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
 mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
 PASS:
 Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
 btl_openib_flags 3 MPI_Isend_ator_c
 Ø  mpirun –np 2 –host host1,host2 –mca 
 btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
 MPI_Isend_ator_c

 So we must have some type of issue when we break up the message between 
 the two openib BTLs.  Maybe someone else can confirm my observations?
 I was testing against the latest trunk.

>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14766.php


Re: [OMPI devel] regression with derived datatypes

2014-05-09 Thread Gilles Gouaillardet
i opened #4610 https://svn.open-mpi.org/trac/ompi/ticket/4610
and attached a patch for the v1.8 branch

i ran several tests from the intel_tests test suite and did not observe
any regression.

please note there are still issues when running with --mca btl
scif,vader,self

this might be an other issue, i will investigate more next week

Gilles

On 2014/05/09 18:08, Gilles Gouaillardet wrote:
> I ran some more investigations with --mca btl scif,self
>
> i found that the previous patch i posted was complete crap and i
> apologize for it.
>
> on a brighter side, and imho, the issue only occurs if fragments are
> received (and then processed) out of order.
> /* i did not observe this with the tcp btl, but i always see that with
> the scif btl, i guess this can be observed too
> with openib+RDMA */
>
> in this case only, opal_convertor_generic_simple_position(...) is
> invoked and does not set the pConvertor->pStack
> as expected by r31496
>
> i will run some more tests from now
>
> Gilles
>
> On 2014/05/08 2:23, George Bosilca wrote:
>> Strange. The outcome and the timing of this issue seems to highlight a link 
>> with the other datatype-related issue you reported earlier, and as suggested 
>> by Ralph with Gilles scif+vader issue.
>>
>> Generally speaking, the mechanism used to split the data in the case of 
>> multiple BTLs, is identical to the one used to split the data in fragments. 
>> So, if the culprit is in the splitting logic, one might see some weirdness 
>> as soon as we force the exclusive usage of the send protocol, with an 
>> unconventional fragment size.
>>
>> In other words using the following flags “—mca btl tcp,self —mca 
>> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 
>> —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when 
>> only one single BTL is in play.
>>
>>   George.
>>
>> On May 7, 2014, at 13:11 , Rolf vandeVaart  wrote:
>>
>>> OK.  So, I investigated a little more.  I only see the issue when I am 
>>> running with multiple ports enabled such that I have two openib BTLs 
>>> instantiated.  In addition, large message RDMA has to be enabled.  If those 
>>> conditions are not met, then I do not see the problem.  For example:
>>> FAILS:
>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
>>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>>> PASS:
>>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
>>> btl_openib_flags 3 MPI_Isend_ator_c
>>> Ø  mpirun –np 2 –host host1,host2 –mca 
>>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
>>> MPI_Isend_ator_c
>>>  
>>> So we must have some type of issue when we break up the message between the 
>>> two openib BTLs.  Maybe someone else can confirm my observations?
>>> I was testing against the latest trunk.
>>>



Re: [OMPI devel] regression with derived datatypes

2014-05-09 Thread Gilles Gouaillardet
I ran some more investigations with --mca btl scif,self

i found that the previous patch i posted was complete crap and i
apologize for it.

on a brighter side, and imho, the issue only occurs if fragments are
received (and then processed) out of order.
/* i did not observe this with the tcp btl, but i always see that with
the scif btl, i guess this can be observed too
with openib+RDMA */

in this case only, opal_convertor_generic_simple_position(...) is
invoked and does not set the pConvertor->pStack
as expected by r31496

i will run some more tests from now

Gilles

On 2014/05/08 2:23, George Bosilca wrote:
> Strange. The outcome and the timing of this issue seems to highlight a link 
> with the other datatype-related issue you reported earlier, and as suggested 
> by Ralph with Gilles scif+vader issue.
>
> Generally speaking, the mechanism used to split the data in the case of 
> multiple BTLs, is identical to the one used to split the data in fragments. 
> So, if the culprit is in the splitting logic, one might see some weirdness as 
> soon as we force the exclusive usage of the send protocol, with an 
> unconventional fragment size.
>
> In other words using the following flags “—mca btl tcp,self —mca 
> btl_tcp_flags 3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 
> —mca btl_tcp_max_send_size 23” should always transfer wrong data, even when 
> only one single BTL is in play.
>
>   George.
>
> On May 7, 2014, at 13:11 , Rolf vandeVaart  wrote:
>
>> OK.  So, I investigated a little more.  I only see the issue when I am 
>> running with multiple ports enabled such that I have two openib BTLs 
>> instantiated.  In addition, large message RDMA has to be enabled.  If those 
>> conditions are not met, then I do not see the problem.  For example:
>> FAILS:
>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
>> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
>> PASS:
>> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
>> btl_openib_flags 3 MPI_Isend_ator_c
>> Ø  mpirun –np 2 –host host1,host2 –mca 
>> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
>> MPI_Isend_ator_c
>>  
>> So we must have some type of issue when we break up the message between the 
>> two openib BTLs.  Maybe someone else can confirm my observations?
>> I was testing against the latest trunk.
>>



Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
Nathan and George,

here are the (compressed) traces

Gilles

On 2014/05/08 16:43, Hjelm, Nathan T wrote:
> If you can get me the backtrace from one of the crash core files I would like 
> to see what is going on there.
>
> -Nathan
> 
> From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
> [gilles.gouaillar...@iferc.org]
> Sent: Thursday, May 08, 2014 1:32 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] regression with derived datatypes
>
> George,
>
> you do not need any hardware, just download MPSS from Intel and install it.
> make sure the mic kernel module is loaded *and* you can read/write to the
> newly created /dev/mic/* devices.
>
> /* i am now running this on a virtual machine with no MIC whatsoever */
>
> i was able to improve things a bit for the new attached test case
> /* send MPI_PACKED / recv newtype */
> with the attached unpack.patch.
>
> it has to be applied on r31678 (aka the latest checkout of the v1.8 branch)
>
> with this patch (zero regression test so far, it might solve one problem
> but break anything else !)
>
> mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2
> works fine :-)
>
> but
>
> mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2
> still crashes (and it did not crash before r31496)
>
> i will provide the output you requested shortly
>
> Cheers,
>
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14745.php



r31678.log.bz2
Description: Binary data


r31678withoutr31496.log.bz2
Description: Binary data


Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Elena Elkina
Hi,

My reproducer failed even with one port enabled (-mca btl_openib_if_include
mlx4_0:1 ).
I tried with trunk as well - the same issue.

Best,
Elena


On Thu, May 8, 2014 at 11:49 AM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:

> Nathan and George,
>
> here are the output files of the original test_scif.c
> the command line was
>
> mpirun -np 2 -host localhost --mca btl scif,vader,self --mca
> mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca
> mpi_ddt_position_debug 1 a.out
>
> this is a silent failure and there is no core file
> the test itself detects it did not receive the expected value
> /* grep "expected" in the output */
>
> Gilles
>
> On 2014/05/08 16:43, Hjelm, Nathan T wrote:
> > If you can get me the backtrace from one of the crash core files I would
> like to see what is going on there.
> >
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14746.php
>


Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
Nathan and George,

here are the output files of the original test_scif.c
the command line was

mpirun -np 2 -host localhost --mca btl scif,vader,self --mca
mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca
mpi_ddt_position_debug 1 a.out

this is a silent failure and there is no core file
the test itself detects it did not receive the expected value
/* grep "expected" in the output */

Gilles

On 2014/05/08 16:43, Hjelm, Nathan T wrote:
> If you can get me the backtrace from one of the crash core files I would like 
> to see what is going on there.
>



Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Hjelm, Nathan T
If you can get me the backtrace from one of the crash core files I would like 
to see what is going on there.

-Nathan

From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
[gilles.gouaillar...@iferc.org]
Sent: Thursday, May 08, 2014 1:32 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] regression with derived datatypes

George,

you do not need any hardware, just download MPSS from Intel and install it.
make sure the mic kernel module is loaded *and* you can read/write to the
newly created /dev/mic/* devices.

/* i am now running this on a virtual machine with no MIC whatsoever */

i was able to improve things a bit for the new attached test case
/* send MPI_PACKED / recv newtype */
with the attached unpack.patch.

it has to be applied on r31678 (aka the latest checkout of the v1.8 branch)

with this patch (zero regression test so far, it might solve one problem
but break anything else !)

mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2
works fine :-)

but

mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2
still crashes (and it did not crash before r31496)

i will provide the output you requested shortly

Cheers,

Gilles


Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet
George,

you do not need any hardware, just download MPSS from Intel and install it.
make sure the mic kernel module is loaded *and* you can read/write to the
newly created /dev/mic/* devices.

/* i am now running this on a virtual machine with no MIC whatsoever */

i was able to improve things a bit for the new attached test case
/* send MPI_PACKED / recv newtype */
with the attached unpack.patch.

it has to be applied on r31678 (aka the latest checkout of the v1.8 branch)

with this patch (zero regression test so far, it might solve one problem
but break anything else !)

mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2
works fine :-)

but

mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2
still crashes (and it did not crash before r31496)

i will provide the output you requested shortly

Cheers,

Gilles
/*
 * This test is an oversimplified version of collective/bcast_struct
 * that comes with the ibm test suite.
 * it must be ran on two tasks on a single host where the MIC software stack
 * is present (e.g. libscif.so is present, the mic driver is loaded and
 * /dev/mic/* are accessible and the scif btl is available.
 *
 * mpirun -np 2 -host localhost --mca scif,vader,self ./test_scif
 * will produce incorrect results with trunk and v1.8
 *
 * mpirun -np 2 --mca btl ^scif -host localhost ./test_scif
 * will work with trunk and v1.8
 *
 * mpirun -np 2 --mca btl scif,self -host localhost ./test_scif
 * will produce correct results with v1.8 r31309 (but eventually crash in 
MPI_Finalize)
 * and produce incorrect result with v1.8 r31671 and trunk r31667
 *
 * Copyright (c) 2011  Oracle and/or its affiliates.  All rights reserved.
 * Copyright (c) 2014  Research Organization for Information Science
 * and Technology (RIST). All rights reserved.
 */
/

 MESSAGE PASSING INTERFACE TEST CASE SUITE

 Copyright IBM Corp. 1995

 IBM Corp. hereby grants a non-exclusive license to use, copy, modify, and
 distribute this software for any purpose and without fee provided that the
 above copyright notice and the following paragraphs appear in all copies.

 IBM Corp. makes no representation that the test cases comprising this
 suite are correct or are an accurate representation of any standard.

 In no event shall IBM be liable to any party for direct, indirect, special
 incidental, or consequential damage arising out of the use of this software
 even if IBM Corp. has been advised of the possibility of such damage.

 IBM CORP. SPECIFICALLY DISCLAIMS ANY WARRANTIES INCLUDING, BUT NOT LIMITED
 TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS AND IBM
 CORP. HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,
 ENHANCEMENTS, OR MODIFICATIONS.



 These test cases reflect an interpretation of the MPI Standard.  They are
 are, in most cases, unit tests of specific MPI behaviors.  If a user of any
 test case from this set believes that the MPI Standard requires behavior
 different than that implied by the test case we would appreciate feedback.

 Comments may be sent to:
Richard Treumann
treum...@kgn.ibm.com


*/
#include 
#include 
#include 
#include "mpi.h"

#define ompitest_error(file,line,...) {fprintf(stderr, "FUCK at %s:%d root=%d 
size=%d (i,j)=(%d,%d)\n", file, line,root, i0, i, j); MPI_Abort(MPI_COMM_WORLD, 
1);}

const int SIZE = 1000;

int main(int argc, char **argv)
{
   int myself;

   double a[2], t_stop;
   int ii, size;
   int len[2];
   MPI_Aint disp[2];
   MPI_Datatype type[2], newtype, t1, t2;
   struct foo_t {
   int i[3];
   double d[3];
   } foo, *bar;
   struct pfoo_t {
   int i[2];
   double d[2];
   } pfoo, *pbar;
   int i0, i, j, root, nseconds = 600, done_flag;
   int _dbg=0;

   MPI_Init(,);
   MPI_Comm_rank(MPI_COMM_WORLD,);
   MPI_Comm_size(MPI_COMM_WORLD,);
   // _dbg = (0 == myself);
   while (_dbg) poll(NULL,0,1);

   if ( argc > 1 ) nseconds = atoi(argv[1]);
   t_stop = MPI_Wtime() + nseconds;

   /*-*/
   /* Build a datatype that is guaranteed to have holes; send/recv
  large numbers of them */

   MPI_Type_vector(2, 1, 2, MPI_INT, );
   MPI_Type_commit();
   MPI_Type_vector(2, 1, 2, MPI_DOUBLE, );
   MPI_Type_commit();

   len[0] = len[1] = 1;
   MPI_Address([0], [0]);
   MPI_Address([0], [1]);
   printf ("%d: %x %x\n", myself, disp[0], disp[1]);
   disp[0] -= (MPI_Aint) 
   disp[1] -= (MPI_Aint) 
   printf ("%d: %ld %ld\n", myself, disp[0], disp[1]);
   type[0] = t1;
   type[1] = t2;
   MPI_Type_struct(2, len, disp, type, );
   MPI_Type_commit();

#if 0
   if (0 == myself) {
   foo.i[0] = 123;
   foo.i[1] = 456;
  

Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread George Bosilca
Nathan, or anybody with access to the target hardware,

If you can provide a minimalistic output of the applications with and
without the above-mentioned patch and with mpi_ddt_unpack_debug and
mpi_ddt_pack_debug, and mpi_ddt_position_debug set to 1, I would try
to help.

  George.


On Thu, May 8, 2014 at 2:50 AM, Hjelm, Nathan T <hje...@lanl.gov> wrote:
> Since I have a system that has the scif libraries installed I will try to 
> reproduce and see if I can come up with a fix. It will probably be sometime 
> next week at the earliest.
>
> -Nathan
> 
> From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
> [gilles.gouaillar...@iferc.org]
> Sent: Wednesday, May 07, 2014 9:03 PM
> To: de...@open-mpi.org
> Subject: Re: [OMPI devel] regression with derived datatypes
>
> On 2014/05/08 2:15, Ralph Castain wrote:
>> I wonder if that might also explain the issue reported by Gilles regarding 
>> the scif BTL? In his example, the problem only occurred if the message was 
>> split across scif and vader. If so, then it might be that splitting messages 
>> in general is broken.
>>
> i am afraid there is a misunderstanding :
> the problem always occur with scif,vader,self (regardless the ompi v1.8
> version)
> the problem occurs with scif,self only if r31496 is applied to ompi v1.8
>
>
> In my previous email
> http://www.open-mpi.org/community/lists/devel/2014/05/14699.php
> i reported the following interesting fact :
>
> with ompi v1.8 (latest r31678), the following command produces incorrect
> results :
> mpirun -host localhost -np 2 --mca btl scif,self ./test_scif
>
> but with ompi v1.8 r31309, the very same command produces correct results
>
> Elena pointed that r31496 is a suspect. so i took the latest v1.8
> (r31678) and reverted r31496 and ...
>
>
> mpirun -host localhost -np 2 --mca btl scif,self ./test_scif
>
> works again !
>
> note that the "default"
> mpirun -host localhost -np 2 --mca btl scif,vader,self ./test_scif
> still produces incorrect results
>
> in order to reproduce the issue, a MIC is *not* needed,
> you only need to install the software stack, load the mic kernel module
> and make sure you can read/write /dev/mic/*
>
> bottom line, there are two issues here :
> 1) r31496 broke something : mpirun -np 2 -host localhost --mca btl
> scif,self ./test_scif
> 2) something else never worked : mpirun -np 2 -host localhost --mca btl
> scif,vader,self ./test_scif
>
> Gilles
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14739.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14742.php


Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Hjelm, Nathan T
Since I have a system that has the scif libraries installed I will try to 
reproduce and see if I can come up with a fix. It will probably be sometime 
next week at the earliest.

-Nathan

From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet 
[gilles.gouaillar...@iferc.org]
Sent: Wednesday, May 07, 2014 9:03 PM
To: de...@open-mpi.org
Subject: Re: [OMPI devel] regression with derived datatypes

On 2014/05/08 2:15, Ralph Castain wrote:
> I wonder if that might also explain the issue reported by Gilles regarding 
> the scif BTL? In his example, the problem only occurred if the message was 
> split across scif and vader. If so, then it might be that splitting messages 
> in general is broken.
>
i am afraid there is a misunderstanding :
the problem always occur with scif,vader,self (regardless the ompi v1.8
version)
the problem occurs with scif,self only if r31496 is applied to ompi v1.8


In my previous email
http://www.open-mpi.org/community/lists/devel/2014/05/14699.php
i reported the following interesting fact :

with ompi v1.8 (latest r31678), the following command produces incorrect
results :
mpirun -host localhost -np 2 --mca btl scif,self ./test_scif

but with ompi v1.8 r31309, the very same command produces correct results

Elena pointed that r31496 is a suspect. so i took the latest v1.8
(r31678) and reverted r31496 and ...


mpirun -host localhost -np 2 --mca btl scif,self ./test_scif

works again !

note that the "default"
mpirun -host localhost -np 2 --mca btl scif,vader,self ./test_scif
still produces incorrect results

in order to reproduce the issue, a MIC is *not* needed,
you only need to install the software stack, load the mic kernel module
and make sure you can read/write /dev/mic/*

bottom line, there are two issues here :
1) r31496 broke something : mpirun -np 2 -host localhost --mca btl
scif,self ./test_scif
2) something else never worked : mpirun -np 2 -host localhost --mca btl
scif,vader,self ./test_scif

Gilles

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14739.php


Re: [OMPI devel] regression with derived datatypes

2014-05-08 Thread Gilles Gouaillardet

On 2014/05/08 2:15, Ralph Castain wrote:
> I wonder if that might also explain the issue reported by Gilles regarding 
> the scif BTL? In his example, the problem only occurred if the message was 
> split across scif and vader. If so, then it might be that splitting messages 
> in general is broken.
>
i am afraid there is a misunderstanding :
the problem always occur with scif,vader,self (regardless the ompi v1.8
version)
the problem occurs with scif,self only if r31496 is applied to ompi v1.8


In my previous email
http://www.open-mpi.org/community/lists/devel/2014/05/14699.php
i reported the following interesting fact :

with ompi v1.8 (latest r31678), the following command produces incorrect
results :
mpirun -host localhost -np 2 --mca btl scif,self ./test_scif

but with ompi v1.8 r31309, the very same command produces correct results

Elena pointed that r31496 is a suspect. so i took the latest v1.8
(r31678) and reverted r31496 and ...


mpirun -host localhost -np 2 --mca btl scif,self ./test_scif

works again !

note that the "default"
mpirun -host localhost -np 2 --mca btl scif,vader,self ./test_scif
still produces incorrect results

in order to reproduce the issue, a MIC is *not* needed,
you only need to install the software stack, load the mic kernel module
and make sure you can read/write /dev/mic/*

bottom line, there are two issues here :
1) r31496 broke something : mpirun -np 2 -host localhost --mca btl
scif,self ./test_scif
2) something else never worked : mpirun -np 2 -host localhost --mca btl
scif,vader,self ./test_scif

Gilles



Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart
I tried this.  However, 23 bytes is too small so I added the 23 to the 56 (79) 
required for the PML header.  I do not get the error.

mpirun -host host0,host1 -np 2 --mca btl self,tcp --mca btl_tcp_flags 3 --mca 
btl_tcp_rndv_eager_limit 23 --mca btl_tcp_eager_limit 23 --mca 
btl_tcp_max_send_size 23 MPI_Isend_ator_c
*** An error occurred in MPI_Init
The "eager limit" MCA parameter in the tcp BTL was set to a value which
is too low for Open MPI to function properly.  Please re-run your job
with a higher eager limit value for this BTL; the exact MCA parameter
name and its corresponding minimum value is shown below.

  Local host:  host0
  BTL name:tcp
  BTL eager limit value:   23 (set via btl_tcp_eager_limit)
  BTL eager limit minimum: 56
  MCA parameter name:  btl_tcp_eager_limit 
--

mpirun -host host0,host1 -np 2 --mca btl self,tcp --mca btl_tcp_flags 3 --mca 
btl_tcp_rndv_eager_limit 79 --mca btl_tcp_eager_limit 79 --mca 
btl_tcp_max_send_size 79 MPI_Isend_ator_c
MPITEST info  (0): Starting MPI_Isend_ator: All Isend TO Root test
MPITEST info  (0): Node spec MPITEST_comm_sizes[6]=2 too large, using 1
MPITEST info  (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1
MPITEST info  (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1
MPITEST_results: MPI_Isend_ator: All Isend TO Root all tests PASSED (3744)


From: devel [devel-boun...@open-mpi.org] On Behalf Of George Bosilca 
[bosi...@icl.utk.edu]
Sent: Wednesday, May 07, 2014 1:23 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] regression with derived datatypes

Strange. The outcome and the timing of this issue seems to highlight a link 
with the other datatype-related issue you reported earlier, and as suggested by 
Ralph with Gilles scif+vader issue.

Generally speaking, the mechanism used to split the data in the case of 
multiple BTLs, is identical to the one used to split the data in fragments. So, 
if the culprit is in the splitting logic, one might see some weirdness as soon 
as we force the exclusive usage of the send protocol, with an unconventional 
fragment size.

In other words using the following flags “—mca btl tcp,self —mca btl_tcp_flags 
3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 —mca 
btl_tcp_max_send_size 23” should always transfer wrong data, even when only one 
single BTL is in play.

  George.

On May 7, 2014, at 13:11 , Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:

OK.  So, I investigated a little more.  I only see the issue when I am running 
with multiple ports enabled such that I have two openib BTLs instantiated.  In 
addition, large message RDMA has to be enabled.  If those conditions are not 
met, then I do not see the problem.  For example:
FAILS:
>  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1,mlx5_0:2 
> –mca btl_openib_flags 3 MPI_Isend_ator_c
PASS:
>  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
> btl_openib_flags 3 MPI_Isend_ator_c
>  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include_mlx5:0:1,mlx5_0:2 
> –mca btl_openib_flags 1 MPI_Isend_ator_c

So we must have some type of issue when we break up the message between the two 
openib BTLs.  Maybe someone else can confirm my observations?
I was testing against the latest trunk.

Rolf

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
Sent: Wednesday, May 07, 2014 10:48 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] regression with derived datatypes

Rolf,
This was run on a Sandy Bridge system with ConnectX-3 cards.
Josh

On Wed, May 7, 2014 at 10:46 AM, Joshua Ladd 
<jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
Elena, can you run your reproducer on the trunk, please, and see if the problem 
persists?
Josh

On Wed, May 7, 2014 at 10:26 AM, Jeff Squyres (jsquyres) 
<jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
On May 7, 2014, at 10:03 AM, Elena Elkina 
<elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:

> Yes, this commit is also in the trunk.
Yes, I understand that -- my question is: is this same *behavior* happening on 
the trunk.  I.e., is there some other effect on the trunk that is causing the 
bad behavior to not occur?

> Best,
> Elena
>
>
> On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) 
> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
> Is this also happening on the trunk?
>
>
> Sent from my phone. No type good.
>
> On May 7, 2014, at 9:44 AM, "Elena Elkina" 
> <elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:
>
>> Sorry,
>>
>> Fixes #4501: Datatype unpack code produces incorrect results in some case
>>
>> ---svn-p

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread George Bosilca
Strange. The outcome and the timing of this issue seems to highlight a link 
with the other datatype-related issue you reported earlier, and as suggested by 
Ralph with Gilles scif+vader issue.

Generally speaking, the mechanism used to split the data in the case of 
multiple BTLs, is identical to the one used to split the data in fragments. So, 
if the culprit is in the splitting logic, one might see some weirdness as soon 
as we force the exclusive usage of the send protocol, with an unconventional 
fragment size.

In other words using the following flags “—mca btl tcp,self —mca btl_tcp_flags 
3 —mca btl_tcp_rndv_eager_limit 23 —mca btl_tcp_eager_limit 23 —mca 
btl_tcp_max_send_size 23” should always transfer wrong data, even when only one 
single BTL is in play.

  George.

On May 7, 2014, at 13:11 , Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> OK.  So, I investigated a little more.  I only see the issue when I am 
> running with multiple ports enabled such that I have two openib BTLs 
> instantiated.  In addition, large message RDMA has to be enabled.  If those 
> conditions are not met, then I do not see the problem.  For example:
> FAILS:
> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
> PASS:
> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
> btl_openib_flags 3 MPI_Isend_ator_c
> Ø  mpirun –np 2 –host host1,host2 –mca 
> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
> MPI_Isend_ator_c
>  
> So we must have some type of issue when we break up the message between the 
> two openib BTLs.  Maybe someone else can confirm my observations?
> I was testing against the latest trunk.
>  
> Rolf
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
> Sent: Wednesday, May 07, 2014 10:48 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] regression with derived datatypes
>  
> Rolf,
> 
> This was run on a Sandy Bridge system with ConnectX-3 cards.
> 
> Josh
>  
> 
> On Wed, May 7, 2014 at 10:46 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> Elena, can you run your reproducer on the trunk, please, and see if the 
> problem persists?
> 
> Josh
>  
> 
> On Wed, May 7, 2014 at 10:26 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> On May 7, 2014, at 10:03 AM, Elena Elkina <elena.elk...@itseez.com> wrote:
> 
> > Yes, this commit is also in the trunk.
> 
> Yes, I understand that -- my question is: is this same *behavior* happening 
> on the trunk.  I.e., is there some other effect on the trunk that is causing 
> the bad behavior to not occur?
> 
> > Best,
> > Elena
> >
> >
> > On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) 
> > <jsquy...@cisco.com> wrote:
> > Is this also happening on the trunk?
> >
> >
> > Sent from my phone. No type good.
> >
> > On May 7, 2014, at 9:44 AM, "Elena Elkina" <elena.elk...@itseez.com> wrote:
> >
> >> Sorry,
> >>
> >> Fixes #4501: Datatype unpack code produces incorrect results in some case
> >>
> >> ---svn-pre-commit-ignore-below---
> >>
> >> r31370 [[BR]]
> >> Reshape all the packing/unpacking functions to use the same skeleton. 
> >> Rewrite the
> >> generic_unpacking to take advantage of the same capabilitites.
> >>
> >> r31380 [[BR]]
> >> Remove a non-necessary label.
> >>
> >> r31387 [[BR]]
> >> Correctly save the displacement for the case where the convertor is not
> >> completed. As we need to have the right displacement at the beginning
> >> of the next call, we should save the position relative to the beginning
> >> of the buffer and not to the last loop.
> >>
> >> Best regards,
> >> Elena
> >>
> >>
> >> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) 
> >> <jsquy...@cisco.com> wrote:
> >> Can you cite the branch and SVN r number?
> >>
> >> Sent from my phone. No type good.
> >>
> >> > On May 7, 2014, at 9:24 AM, "Elena Elkina" <elena.elk...@itseez.com> 
> >> > wrote:
> >> >
> >> > b531973419a056696e6f88d813769aa4f1f1aee6
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
> >>
> >> 

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Ralph Castain
I wonder if that might also explain the issue reported by Gilles regarding the 
scif BTL? In his example, the problem only occurred if the message was split 
across scif and vader. If so, then it might be that splitting messages in 
general is broken.


On May 7, 2014, at 10:11 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> OK.  So, I investigated a little more.  I only see the issue when I am 
> running with multiple ports enabled such that I have two openib BTLs 
> instantiated.  In addition, large message RDMA has to be enabled.  If those 
> conditions are not met, then I do not see the problem.  For example:
> FAILS:
> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include 
> mlx5_0:1,mlx5_0:2 –mca btl_openib_flags 3 MPI_Isend_ator_c
> PASS:
> Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
> btl_openib_flags 3 MPI_Isend_ator_c
> Ø  mpirun –np 2 –host host1,host2 –mca 
> btl_openib_if_include_mlx5:0:1,mlx5_0:2 –mca btl_openib_flags 1 
> MPI_Isend_ator_c
>  
> So we must have some type of issue when we break up the message between the 
> two openib BTLs.  Maybe someone else can confirm my observations?
> I was testing against the latest trunk.
>  
> Rolf
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
> Sent: Wednesday, May 07, 2014 10:48 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] regression with derived datatypes
>  
> Rolf,
> 
> This was run on a Sandy Bridge system with ConnectX-3 cards.
> 
> Josh
>  
> 
> On Wed, May 7, 2014 at 10:46 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
> Elena, can you run your reproducer on the trunk, please, and see if the 
> problem persists?
> 
> Josh
>  
> 
> On Wed, May 7, 2014 at 10:26 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> 
> wrote:
> On May 7, 2014, at 10:03 AM, Elena Elkina <elena.elk...@itseez.com> wrote:
> 
> > Yes, this commit is also in the trunk.
> 
> Yes, I understand that -- my question is: is this same *behavior* happening 
> on the trunk.  I.e., is there some other effect on the trunk that is causing 
> the bad behavior to not occur?
> 
> > Best,
> > Elena
> >
> >
> > On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) 
> > <jsquy...@cisco.com> wrote:
> > Is this also happening on the trunk?
> >
> >
> > Sent from my phone. No type good.
> >
> > On May 7, 2014, at 9:44 AM, "Elena Elkina" <elena.elk...@itseez.com> wrote:
> >
> >> Sorry,
> >>
> >> Fixes #4501: Datatype unpack code produces incorrect results in some case
> >>
> >> ---svn-pre-commit-ignore-below---
> >>
> >> r31370 [[BR]]
> >> Reshape all the packing/unpacking functions to use the same skeleton. 
> >> Rewrite the
> >> generic_unpacking to take advantage of the same capabilitites.
> >>
> >> r31380 [[BR]]
> >> Remove a non-necessary label.
> >>
> >> r31387 [[BR]]
> >> Correctly save the displacement for the case where the convertor is not
> >> completed. As we need to have the right displacement at the beginning
> >> of the next call, we should save the position relative to the beginning
> >> of the buffer and not to the last loop.
> >>
> >> Best regards,
> >> Elena
> >>
> >>
> >> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) 
> >> <jsquy...@cisco.com> wrote:
> >> Can you cite the branch and SVN r number?
> >>
> >> Sent from my phone. No type good.
> >>
> >> > On May 7, 2014, at 9:24 AM, "Elena Elkina" <elena.elk...@itseez.com> 
> >> > wrote:
> >> >
> >> > b531973419a056696e6f88d813769aa4f1f1aee6
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/05/14702.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > 

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart
OK.  So, I investigated a little more.  I only see the issue when I am running 
with multiple ports enabled such that I have two openib BTLs instantiated.  In 
addition, large message RDMA has to be enabled.  If those conditions are not 
met, then I do not see the problem.  For example:
FAILS:

Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1,mlx5_0:2 
–mca btl_openib_flags 3 MPI_Isend_ator_c
PASS:

Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include mlx5_0:1 –mca 
btl_openib_flags 3 MPI_Isend_ator_c

Ø  mpirun –np 2 –host host1,host2 –mca btl_openib_if_include_mlx5:0:1,mlx5_0:2 
–mca btl_openib_flags 1 MPI_Isend_ator_c

So we must have some type of issue when we break up the message between the two 
openib BTLs.  Maybe someone else can confirm my observations?
I was testing against the latest trunk.

Rolf

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Joshua Ladd
Sent: Wednesday, May 07, 2014 10:48 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] regression with derived datatypes

Rolf,
This was run on a Sandy Bridge system with ConnectX-3 cards.
Josh

On Wed, May 7, 2014 at 10:46 AM, Joshua Ladd 
<jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
Elena, can you run your reproducer on the trunk, please, and see if the problem 
persists?
Josh

On Wed, May 7, 2014 at 10:26 AM, Jeff Squyres (jsquyres) 
<jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
On May 7, 2014, at 10:03 AM, Elena Elkina 
<elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:

> Yes, this commit is also in the trunk.
Yes, I understand that -- my question is: is this same *behavior* happening on 
the trunk.  I.e., is there some other effect on the trunk that is causing the 
bad behavior to not occur?

> Best,
> Elena
>
>
> On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) 
> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
> Is this also happening on the trunk?
>
>
> Sent from my phone. No type good.
>
> On May 7, 2014, at 9:44 AM, "Elena Elkina" 
> <elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:
>
>> Sorry,
>>
>> Fixes #4501: Datatype unpack code produces incorrect results in some case
>>
>> ---svn-pre-commit-ignore-below---
>>
>> r31370 [[BR]]
>> Reshape all the packing/unpacking functions to use the same skeleton. 
>> Rewrite the
>> generic_unpacking to take advantage of the same capabilitites.
>>
>> r31380 [[BR]]
>> Remove a non-necessary label.
>>
>> r31387 [[BR]]
>> Correctly save the displacement for the case where the convertor is not
>> completed. As we need to have the right displacement at the beginning
>> of the next call, we should save the position relative to the beginning
>> of the buffer and not to the last loop.
>>
>> Best regards,
>> Elena
>>
>>
>> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) 
>> <jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
>> Can you cite the branch and SVN r number?
>>
>> Sent from my phone. No type good.
>>
>> > On May 7, 2014, at 9:24 AM, "Elena Elkina" 
>> > <elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:
>> >
>> > b531973419a056696e6f88d813769aa4f1f1aee6
>> ___
>> devel mailing list
>> de...@open-mpi.org<mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org<mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14702.php
>
> ___
> devel mailing list
> de...@open-mpi.org<mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14703.php
>
> ___
> devel mailing list
> de...@open-mpi.org<mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14704.php


--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___

Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Joshua Ladd
Rolf,

This was run on a Sandy Bridge system with ConnectX-3 cards.

Josh


On Wed, May 7, 2014 at 10:46 AM, Joshua Ladd  wrote:

> Elena, can you run your reproducer on the trunk, please, and see if the
> problem persists?
>
> Josh
>
>
> On Wed, May 7, 2014 at 10:26 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On May 7, 2014, at 10:03 AM, Elena Elkina 
>> wrote:
>>
>> > Yes, this commit is also in the trunk.
>>
>> Yes, I understand that -- my question is: is this same *behavior*
>> happening on the trunk.  I.e., is there some other effect on the trunk that
>> is causing the bad behavior to not occur?
>>
>> > Best,
>> > Elena
>> >
>> >
>> > On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>> > Is this also happening on the trunk?
>> >
>> >
>> > Sent from my phone. No type good.
>> >
>> > On May 7, 2014, at 9:44 AM, "Elena Elkina" 
>> wrote:
>> >
>> >> Sorry,
>> >>
>> >> Fixes #4501: Datatype unpack code produces incorrect results in some
>> case
>> >>
>> >> ---svn-pre-commit-ignore-below---
>> >>
>> >> r31370 [[BR]]
>> >> Reshape all the packing/unpacking functions to use the same skeleton.
>> Rewrite the
>> >> generic_unpacking to take advantage of the same capabilitites.
>> >>
>> >> r31380 [[BR]]
>> >> Remove a non-necessary label.
>> >>
>> >> r31387 [[BR]]
>> >> Correctly save the displacement for the case where the convertor is not
>> >> completed. As we need to have the right displacement at the beginning
>> >> of the next call, we should save the position relative to the beginning
>> >> of the buffer and not to the last loop.
>> >>
>> >> Best regards,
>> >> Elena
>> >>
>> >>
>> >> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>> >> Can you cite the branch and SVN r number?
>> >>
>> >> Sent from my phone. No type good.
>> >>
>> >> > On May 7, 2014, at 9:24 AM, "Elena Elkina" 
>> wrote:
>> >> >
>> >> > b531973419a056696e6f88d813769aa4f1f1aee6
>> >> ___
>> >> devel mailing list
>> >> de...@open-mpi.org
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
>> >>
>> >> ___
>> >> devel mailing list
>> >> de...@open-mpi.org
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14702.php
>> >
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14703.php
>> >
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14704.php
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14706.php
>>
>
>


Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Joshua Ladd
Elena, can you run your reproducer on the trunk, please, and see if the
problem persists?

Josh


On Wed, May 7, 2014 at 10:26 AM, Jeff Squyres (jsquyres)  wrote:

> On May 7, 2014, at 10:03 AM, Elena Elkina  wrote:
>
> > Yes, this commit is also in the trunk.
>
> Yes, I understand that -- my question is: is this same *behavior*
> happening on the trunk.  I.e., is there some other effect on the trunk that
> is causing the bad behavior to not occur?
>
> > Best,
> > Elena
> >
> >
> > On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Is this also happening on the trunk?
> >
> >
> > Sent from my phone. No type good.
> >
> > On May 7, 2014, at 9:44 AM, "Elena Elkina" 
> wrote:
> >
> >> Sorry,
> >>
> >> Fixes #4501: Datatype unpack code produces incorrect results in some
> case
> >>
> >> ---svn-pre-commit-ignore-below---
> >>
> >> r31370 [[BR]]
> >> Reshape all the packing/unpacking functions to use the same skeleton.
> Rewrite the
> >> generic_unpacking to take advantage of the same capabilitites.
> >>
> >> r31380 [[BR]]
> >> Remove a non-necessary label.
> >>
> >> r31387 [[BR]]
> >> Correctly save the displacement for the case where the convertor is not
> >> completed. As we need to have the right displacement at the beginning
> >> of the next call, we should save the position relative to the beginning
> >> of the buffer and not to the last loop.
> >>
> >> Best regards,
> >> Elena
> >>
> >>
> >> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >> Can you cite the branch and SVN r number?
> >>
> >> Sent from my phone. No type good.
> >>
> >> > On May 7, 2014, at 9:24 AM, "Elena Elkina" 
> wrote:
> >> >
> >> > b531973419a056696e6f88d813769aa4f1f1aee6
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14702.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14703.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14704.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14706.php
>


Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Jeff Squyres (jsquyres)
On May 7, 2014, at 10:03 AM, Elena Elkina  wrote:

> Yes, this commit is also in the trunk.

Yes, I understand that -- my question is: is this same *behavior* happening on 
the trunk.  I.e., is there some other effect on the trunk that is causing the 
bad behavior to not occur?

> Best,
> Elena
> 
> 
> On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres)  
> wrote:
> Is this also happening on the trunk?
> 
> 
> Sent from my phone. No type good. 
> 
> On May 7, 2014, at 9:44 AM, "Elena Elkina"  wrote:
> 
>> Sorry,
>> 
>> Fixes #4501: Datatype unpack code produces incorrect results in some case
>> 
>> ---svn-pre-commit-ignore-below---
>> 
>> r31370 [[BR]]
>> Reshape all the packing/unpacking functions to use the same skeleton. 
>> Rewrite the
>> generic_unpacking to take advantage of the same capabilitites.
>> 
>> r31380 [[BR]]
>> Remove a non-necessary label.
>> 
>> r31387 [[BR]]
>> Correctly save the displacement for the case where the convertor is not
>> completed. As we need to have the right displacement at the beginning
>> of the next call, we should save the position relative to the beginning
>> of the buffer and not to the last loop.
>> 
>> Best regards,
>> Elena
>> 
>> 
>> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> Can you cite the branch and SVN r number?
>> 
>> Sent from my phone. No type good.
>> 
>> > On May 7, 2014, at 9:24 AM, "Elena Elkina"  wrote:
>> >
>> > b531973419a056696e6f88d813769aa4f1f1aee6
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14702.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14703.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14704.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Rolf vandeVaart
This seems similar to what I reported on a different thread.

http://www.open-mpi.org/community/lists/devel/2014/05/14688.php

I need to try and reproduce again.  Elena, what kind of cluster were your 
running on?

Rolf

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Elena Elkina
Sent: Wednesday, May 07, 2014 10:04 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] regression with derived datatypes

Yes, this commit is also in the trunk.

Best,
Elena

On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres) 
<jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
Is this also happening on the trunk?


Sent from my phone. No type good.

On May 7, 2014, at 9:44 AM, "Elena Elkina" 
<elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:
Sorry,

Fixes #4501: Datatype unpack code produces incorrect results in some case

---svn-pre-commit-ignore-below---

r31370 [[BR]]
Reshape all the packing/unpacking functions to use the same skeleton. Rewrite 
the
generic_unpacking to take advantage of the same capabilitites.

r31380 [[BR]]
Remove a non-necessary label.

r31387 [[BR]]
Correctly save the displacement for the case where the convertor is not
completed. As we need to have the right displacement at the beginning
of the next call, we should save the position relative to the beginning
of the buffer and not to the last loop.

Best regards,
Elena

On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) 
<jsquy...@cisco.com<mailto:jsquy...@cisco.com>> wrote:
Can you cite the branch and SVN r number?

Sent from my phone. No type good.

> On May 7, 2014, at 9:24 AM, "Elena Elkina" 
> <elena.elk...@itseez.com<mailto:elena.elk...@itseez.com>> wrote:
>
> b531973419a056696e6f88d813769aa4f1f1aee6
___
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14701.php

___
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14702.php

___
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14703.php


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---


Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Elena Elkina
Yes, this commit is also in the trunk.

Best,
Elena


On Wed, May 7, 2014 at 5:45 PM, Jeff Squyres (jsquyres)
wrote:

>  Is this also happening on the trunk?
>
>
> Sent from my phone. No type good.
>
> On May 7, 2014, at 9:44 AM, "Elena Elkina" 
> wrote:
>
>   Sorry,
>
>  Fixes #4501: Datatype unpack code produces incorrect results in some case
>
> ---svn-pre-commit-ignore-below---
>
> r31370 [[BR]]
> Reshape all the packing/unpacking functions to use the same skeleton.
> Rewrite the
> generic_unpacking to take advantage of the same capabilitites.
>
> r31380 [[BR]]
> Remove a non-necessary label.
>
> r31387 [[BR]]
> Correctly save the displacement for the case where the convertor is not
> completed. As we need to have the right displacement at the beginning
> of the next call, we should save the position relative to the beginning
> of the buffer and not to the last loop.
>
>  Best regards,
> Elena
>
>
> On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Can you cite the branch and SVN r number?
>>
>> Sent from my phone. No type good.
>>
>> > On May 7, 2014, at 9:24 AM, "Elena Elkina" 
>> wrote:
>> >
>> > b531973419a056696e6f88d813769aa4f1f1aee6
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
>>
>
>   ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14702.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14703.php
>


Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Jeff Squyres (jsquyres)
Is this also happening on the trunk?

Sent from my phone. No type good.

On May 7, 2014, at 9:44 AM, "Elena Elkina" 
> wrote:

Sorry,

Fixes #4501: Datatype unpack code produces incorrect results in some case

---svn-pre-commit-ignore-below---

r31370 [[BR]]
Reshape all the packing/unpacking functions to use the same skeleton. Rewrite 
the
generic_unpacking to take advantage of the same capabilitites.

r31380 [[BR]]
Remove a non-necessary label.

r31387 [[BR]]
Correctly save the displacement for the case where the convertor is not
completed. As we need to have the right displacement at the beginning
of the next call, we should save the position relative to the beginning
of the buffer and not to the last loop.

Best regards,
Elena


On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres) 
> wrote:
Can you cite the branch and SVN r number?

Sent from my phone. No type good.

> On May 7, 2014, at 9:24 AM, "Elena Elkina" 
> > wrote:
>
> b531973419a056696e6f88d813769aa4f1f1aee6
___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14701.php

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/05/14702.php


Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Elena Elkina
Sorry,

Fixes #4501: Datatype unpack code produces incorrect results in some case

---svn-pre-commit-ignore-below---

r31370 [[BR]]
Reshape all the packing/unpacking functions to use the same skeleton.
Rewrite the
generic_unpacking to take advantage of the same capabilitites.

r31380 [[BR]]
Remove a non-necessary label.

r31387 [[BR]]
Correctly save the displacement for the case where the convertor is not
completed. As we need to have the right displacement at the beginning
of the next call, we should save the position relative to the beginning
of the buffer and not to the last loop.

Best regards,
Elena


On Wed, May 7, 2014 at 5:43 PM, Jeff Squyres (jsquyres)
wrote:

> Can you cite the branch and SVN r number?
>
> Sent from my phone. No type good.
>
> > On May 7, 2014, at 9:24 AM, "Elena Elkina" 
> wrote:
> >
> > b531973419a056696e6f88d813769aa4f1f1aee6
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14701.php
>


Re: [OMPI devel] regression with derived datatypes

2014-05-07 Thread Jeff Squyres (jsquyres)
Can you cite the branch and SVN r number?

Sent from my phone. No type good. 

> On May 7, 2014, at 9:24 AM, "Elena Elkina"  wrote:
> 
> b531973419a056696e6f88d813769aa4f1f1aee6