The problem is correctly identified and solved. I already pushed the patch in the trunk. I will create the CMR for both 1.5 and 1.4.
Kudos to the Fujitsu team, that was a tricky one to find. Thanks for you contributions! george. On Jan 12, 2012, at 10:39 , Barrett, Brian W wrote: > George - > > This looks right to me, but the patches are in the datatype engine, so can > you weigh in? > > Thanks, > > Brian > > On 1/11/12 10:04 PM, "Kawashima" <t-kawash...@jp.fujitsu.com> wrote: > >> Hi Open MPI developers, >> >> We, Fujitsu, noticed that one-sided communication with some sort of >> derived datatype fails on sparc64 machines. >> >> In one-sided communication of Open MPI, the structure of datatype of >> target buffer is: >> (1) encoded in origin process, and >> (2) transfered to target process, and >> (3) decoded in target process. >> >> This encoding and decoding are processed in ompi_datatype_args.c and >> it has consideration of alignment issue. But it seems insufficient. >> >> On encoding stage, __ompi_datatype_pack_description function >> has consideration of alignment issue, as described in its comment. >> For derived datatypes of one level, that code is OK. >> But for derived datatypes of multiple level (i.e. derived datatypes >> created from derived datatypes), __ompi_datatype_pack_description >> function is called recursively with unaligned packed_buffer if >> args->ci is odd. >> >> On the other hand, on decoding stage, >> __ompi_datatype_create_from_packed_description function expects >> a padding for odd args->ci. For derived datatypes, packed_buffer is >> always aligned to 64 bits even if this function is called recursively. >> >> This incompatibility causes a segmentation fault or something >> in ompi_ddt_create_xxxx function called by __ompi_ddt_create_from_args >> function. >> >> It seems decoding stage and buffer size calculation (ALLOC_ARGS macro) >> have an enough consideration of alignment issue. So I think fixing >> encoding >> stage is sufficient for this bug. >> >> I've attached patches for trunk and v1.4 branch respectively. >> A program (needs sparc64) to reproduce this probrem is also attached. >> >> This bug appears if all following conditions are met. >> >> - sparc64 or some alignment sensitive architectures >> (configure generates OPAL_ALIGN_WORD_SIZE_INTEGERS == 1) >> - use derived datatype for target buffer of one-sided communication >> - create that derived datatype by multiple level MPI_Type_create_xxxx >> - use one of following function in second level or later >> (args->ci is odd) >> * MPI_Type_create_hvector >> * MPI_Type_create_struct >> * MPI_Type_create_hindexed >> * MPI_Type_create_indexed_block >> >> >> Regards, >> >> Takahiro Kawashima, >> MPI development team, >> Fujitsu >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Brian W. Barrett > Dept. 1423: Scalable System Software > Sandia National Laboratories > > > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel