The problem is correctly identified and solved. I already pushed the patch in 
the trunk. I will create the CMR for both 1.5 and 1.4.

Kudos to the Fujitsu team, that was a tricky one to find. Thanks for you 
contributions!

  george.

On Jan 12, 2012, at 10:39 , Barrett, Brian W wrote:

> George -
> 
> This looks right to me, but the patches are in the datatype engine, so can
> you weigh in?
> 
> Thanks,
> 
> Brian
> 
> On 1/11/12 10:04 PM, "Kawashima" <t-kawash...@jp.fujitsu.com> wrote:
> 
>> Hi Open MPI developers,
>> 
>> We, Fujitsu, noticed that one-sided communication with some sort of
>> derived datatype fails on sparc64 machines.
>> 
>> In one-sided communication of Open MPI, the structure of datatype of
>> target buffer is:
>> (1) encoded in origin process, and
>> (2) transfered to target process, and
>> (3) decoded in target process.
>> 
>> This encoding and decoding are processed in ompi_datatype_args.c and
>> it has consideration of alignment issue. But it seems insufficient.
>> 
>> On encoding stage, __ompi_datatype_pack_description function
>> has consideration of alignment issue, as described in its comment.
>> For derived datatypes of one level, that code is OK.
>> But for derived datatypes of multiple level (i.e. derived datatypes
>> created from derived datatypes), __ompi_datatype_pack_description
>> function is called recursively with unaligned packed_buffer if
>> args->ci is odd.
>> 
>> On the other hand, on decoding stage,
>> __ompi_datatype_create_from_packed_description function expects
>> a padding for odd args->ci. For derived datatypes, packed_buffer is
>> always aligned to 64 bits even if this function is called recursively.
>> 
>> This incompatibility causes a segmentation fault or something
>> in ompi_ddt_create_xxxx function called by __ompi_ddt_create_from_args
>> function.
>> 
>> It seems decoding stage and buffer size calculation (ALLOC_ARGS macro)
>> have an enough consideration of alignment issue. So I think fixing
>> encoding
>> stage is sufficient for this bug.
>> 
>> I've attached patches for trunk and v1.4 branch respectively.
>> A program (needs sparc64) to reproduce this probrem is also attached.
>> 
>> This bug appears if all following conditions are met.
>> 
>> - sparc64 or some alignment sensitive architectures
>>   (configure generates OPAL_ALIGN_WORD_SIZE_INTEGERS == 1)
>> - use derived datatype for target buffer of one-sided communication
>> - create that derived datatype by multiple level MPI_Type_create_xxxx
>> - use one of following function in second level or later
>>   (args->ci is odd)
>>     * MPI_Type_create_hvector
>>     * MPI_Type_create_struct
>>     * MPI_Type_create_hindexed
>>     * MPI_Type_create_indexed_block
>> 
>> 
>> Regards,
>> 
>> Takahiro Kawashima,
>> MPI development team,
>> Fujitsu
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
> 
> 
> 
> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to