Hi,
I'm currently working on a bug occuring at the client site with
openmpi when calling MPI_Sendreceive() on datatypes built by the
application.
I think I've found where the bug comes from (it is located in
opal_generic_simple_pack_function() - file
opal/datatype/opal_datatype_pack.c). But this code is so complicated
that I'm more than unsure of my fix. What I can say is that it fixes
things for me, but I need some advices from the datatypes specialists.
---------------
You will find in attachment the reproducer provided by the client, as
well as the resulting output.
datatypes.c : reproducer
to run the binary: salloc --exclusive -p B510 -N 1 -n 1 mpirun
./datatypes
trc_ko: traces got without the patch applied
trc_ok: traces got with the patch applied.
---------------
The proposed patch is the following: (Note that the very first change
in this patch was enough in my case, but I thought all the
"source_base" settings should follow this model.)
-------------------------
opal_generic_simple_pack_function: add the datatype lb when
progressing in the input buffer
diff -r cb23c2f07e1f opal/datatype/opal_datatype_pack.c
--- a/opal/datatype/opal_datatype_pack.c Sun Nov 24 17:06:51
2013 +0000
+++ b/opal/datatype/opal_datatype_pack.c Mon Nov 25 10:48:00
2013 +0100
@@ -301,7 +301,7 @@ opal_generic_simple_pack_function( opal_
PACK_PREDEFINED_DATATYPE( pConvertor, pElem, count_desc,
source_base, destination,
iov_len_local );
if( 0 == count_desc ) { /* completed */
- source_base = pConvertor->pBaseBuf + pStack->disp;
+ source_base = pConvertor->pBaseBuf +
pStack->disp + pData->lb;
pos_desc++; /* advance to the next data */
UPDATE_INTERNAL_COUNTERS( description, pos_desc,
pElem, count_desc );
continue;
@@ -333,7 +333,7 @@ opal_generic_simple_pack_function( opal_
pStack->disp +=
description[pStack->index].loop.extent;
}
}
- source_base = pConvertor->pBaseBuf + pStack->disp;
+ source_base = pConvertor->pBaseBuf + pStack->disp +
pData->lb;
UPDATE_INTERNAL_COUNTERS( description, pos_desc,
pElem, count_desc );
DO_DEBUG( opal_output( 0, "pack new_loop count %d
stack_pos %d pos_desc %d disp %ld space %lu\n",
(int)pStack->count, pConvertor->stack_pos, pos_desc,
(long)pStack->disp, (unsigned long)iov_len_local ); );
@@ -354,7 +354,7 @@ opal_generic_simple_pack_function( opal_
pStack->disp + local_disp);
pos_desc++;
update_loop_description: /* update the current state */
- source_base = pConvertor->pBaseBuf + pStack->disp;
+ source_base = pConvertor->pBaseBuf + pStack->disp +
pData->lb;
UPDATE_INTERNAL_COUNTERS( description, pos_desc,
pElem, count_desc );
DDT_DUMP_STACK( pConvertor->pStack,
pConvertor->stack_pos, pElem, "advance loop" );
continue;
@@ -374,7 +374,7 @@ opal_generic_simple_pack_function( opal_
}
/* I complete an element, next step I should go to the next one */
PUSH_STACK( pStack, pConvertor->stack_pos, pos_desc,
OPAL_DATATYPE_INT8, count_desc,
- source_base - pStack->disp - pConvertor->pBaseBuf );
+ source_base - pStack->disp - pConvertor->pBaseBuf -
pData->lb );
DO_DEBUG( opal_output( 0, "pack save stack stack_pos %d pos_desc
%d count_desc %d disp %ld\n",
pConvertor->stack_pos, pStack->index,
(int)pStack->count, (long)pStack->disp ); );
return 0;
-------------------------------
Regards,
Nadia
--
Nadia Derbey
Bull, Architect of an Open World
http://www.bull.com
<datatypes.c><trc_ko.txt><trc_ok.txt>_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel