Nadia,

I guess the revisions mentioned are from HG? If I’m not mistaken the change you 
mentioned corresponds to r29285. I’m not sure if they are related, as r29285 is 
about positioning a convertor, and this is only used in the case of 
multi-fragments messages. As this is not the case for your example, I don’t 
think they are related.

I guess we should look at all the patches in the opal/datatype and 
ompi/datatype over the last 13 months (the starting point of the 1.6.3).

  George.


On Nov 25, 2013, at 14:10 , Nadia Derbey <nadia.der...@bull.net> wrote:

> George,
> 
> Thx for the detailed answer!
> I did my tests on a v1.6.2 (changeset: 141b22222759).
> After you told me it worked for you with earlier releases, I looked at the 
> changesets applied since that time. I guess 28fd94d282a3 is the one that 
> fixes my issue?
> 
> Regards,
> Nadia
> 
> On 25/11/2013 13:36, George Bosilca wrote:
>> Nadia,
>> 
>> Which version of Open MPI are you using? I tried with the nightly r29751, 
>> the current 1.6 and the current 1.7 and I __always__ got the expected output.
>> 
>> There is a simple way to show what the datatype engine is doing. You can set 
>> the MCA parameters
>> mpi_ddt_unpack_debug and mpi_ddt_pack_debug to get more info. If you only 
>> want to see how the datatype looks after the MPI_Commit step you can call 
>> directly ompi_datatype_dump(ddt). This will show the internals of the 
>> datatype, converted in predefined types.
>> 
>> As an example I took the application you provided and build the following 
>> picture of what is send and what is received (original buffer, send 
>> datatype, packed buffer, recv datatype, resulting buffer).
>> 
>> <Mail Attachment.png>
>> 
>> Now using the ompi_datatype_dump, I see the recv and the send datatypes as:
>> 
>> -cC---P-DB-[---][---]     OPAL_UINT1 count 8 disp 0x0 (0) extent 1 (size 8)
>> -cC---P-DB-[---][---]     OPAL_UINT1 count 8 disp 0x10 (16) extent 1 (size 8)
>> -cC—P-DB-[—][—]           OPAL_INT4 count 4 disp 0x30 (48) extent 4 (size 16)
>> ———————-G—[—][—]          OPAL_END_LOOP pref 3 elements first elem 
>> displacement 0 size of data 32
>> 
>> -cC---P-DB-[---][---]     OPAL_UINT1 count 24 disp 0x10 (16) extent 1 (size 
>> 24)
>> -cC---P-DB-[---][---]     OPAL_UINT1 count 8 disp 0x30 (48) extent 1 (size 8)
>> ———G---[---][---]         OPAL_END_LOOP prev 2 elements first elem 
>> displacement 16 size of data 32
>> 
>> This match perfectly to the datatype drawn by hand.
>> 
>>   George.
>> 
>> 
>> 
>> On Nov 25, 2013, at 11:40 , Nadia Derbey <nadia.der...@bull.net> wrote:
>> 
>>> Hi,
>>> 
>>> I'm currently working on a bug occuring at the client site with openmpi 
>>> when calling MPI_Sendreceive() on datatypes built by the application.
>>> I think I've found where the bug comes from (it is located in 
>>> opal_generic_simple_pack_function() - file 
>>> opal/datatype/opal_datatype_pack.c). But this code is so complicated that 
>>> I'm more than unsure of my fix. What I can say is that it fixes things for 
>>> me, but I need some advices from the datatypes specialists.
>>> 
>>> ---------------
>>> 
>>> You will find in attachment the reproducer provided by the client, as well 
>>> as the resulting output.
>>> datatypes.c : reproducer
>>> to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun ./datatypes
>>> trc_ko: traces got without the patch applied
>>> trc_ok: traces got with the patch applied.
>>> 
>>> ---------------
>>> 
>>> The proposed patch is the following: (Note that the very first change in 
>>> this patch was enough in my case, but I thought all the "source_base" 
>>> settings should follow this model.)
>>> 
>>> -------------------------
>>> opal_generic_simple_pack_function: add the datatype lb when progressing in 
>>> the input buffer
>>> 
>>> diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
>>> --- a/opal/datatype/opal_datatype_pack.c        Sun Nov 24 17:06:51 2013 
>>> +0000
>>> +++ b/opal/datatype/opal_datatype_pack.c        Mon Nov 25 10:48:00 2013 
>>> +0100
>>> @@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
>>>                  PACK_PREDEFINED_DATATYPE( pConvertor, pElem, count_desc,
>>>                                            source_base, destination, 
>>> iov_len_local );
>>>                  if( 0 == count_desc ) {  /* completed */
>>> -                    source_base = pConvertor->pBaseBuf + pStack->disp;
>>> +                    source_base = pConvertor->pBaseBuf + pStack->disp + 
>>> pData->lb;
>>>                      pos_desc++;  /* advance to the next data */
>>>                      UPDATE_INTERNAL_COUNTERS( description, pos_desc, 
>>> pElem, count_desc );
>>>                      continue;
>>> @@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
>>>                          pStack->disp += 
>>> description[pStack->index].loop.extent;
>>>                      }
>>>                  }
>>> -                source_base = pConvertor->pBaseBuf + pStack->disp;
>>> +                source_base = pConvertor->pBaseBuf + pStack->disp + 
>>> pData->lb;
>>>                  UPDATE_INTERNAL_COUNTERS( description, pos_desc, pElem, 
>>> count_desc );
>>>                  DO_DEBUG( opal_output( 0, "pack new_loop count %d 
>>> stack_pos %d pos_desc %d disp %ld space %lu\n",
>>>                                         (int)pStack->count, 
>>> pConvertor->stack_pos, pos_desc, (long)pStack->disp, (unsigned 
>>> long)iov_len_local ); );
>>> @@ -354,7 +354,7 @@ opal_generic_simple_pack_function( opal_
>>>                              pStack->disp + local_disp);
>>>                  pos_desc++;
>>>              update_loop_description:  /* update the current state */
>>> -                source_base = pConvertor->pBaseBuf + pStack->disp;
>>> +                source_base = pConvertor->pBaseBuf + pStack->disp + 
>>> pData->lb;
>>>                  UPDATE_INTERNAL_COUNTERS( description, pos_desc, pElem, 
>>> count_desc );
>>>                  DDT_DUMP_STACK( pConvertor->pStack, pConvertor->stack_pos, 
>>> pElem, "advance loop" );
>>>                  continue;
>>> @@ -374,7 +374,7 @@ opal_generic_simple_pack_function( opal_
>>>      }
>>>      /* I complete an element, next step I should go to the next one */
>>>      PUSH_STACK( pStack, pConvertor->stack_pos, pos_desc, 
>>> OPAL_DATATYPE_INT8, count_desc,
>>> -                source_base - pStack->disp - pConvertor->pBaseBuf );
>>> +                source_base - pStack->disp - pConvertor->pBaseBuf - 
>>> pData->lb );
>>>      DO_DEBUG( opal_output( 0, "pack save stack stack_pos %d pos_desc %d 
>>> count_desc %d disp %ld\n",
>>>                             pConvertor->stack_pos, pStack->index, 
>>> (int)pStack->count, (long)pStack->disp ); );
>>>      return 0;
>>> 
>>> -------------------------------
>>> 
>>> Regards,
>>> Nadia
>>> -- 
>>> Nadia Derbey
>>> Bull, Architect of an Open World
>>> http://www.bull.com
>>> <datatypes.c><trc_ko.txt><trc_ok.txt>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Nadia Derbey
> Bull, Architect of an Open World
> http://www.bull.com
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to