I fix the problem we had with BLACS. As it look like everybody believe it was a data-type issue I fix it in the DDT engine. However, as I explain this morning on the phone conference (and nobody believe it) the problem was triggered by the way the convertor was used. For me it's an easy fix at the DDT layer that will allow BTL developers to pay less attention to the way they pack/unpack data ... but it is not the way the DDT was designed.

Here is the explanation of what was wrong inside:
BLACS create a triangular matrix using an indexed type. The memory layout of this data-type is composed by several contiguous buffers with some gaps in between. The problem we had was the following: 1. on the sender size pack was called with a buffer large enough to hold all the data. 2. on the receiver side the unpack was called twice with different iovecs. Even if the total length of the 2 iovec was the correct length it happen that the length of the first one was too short making the convertor to stop in the middle of a basic type. And that was not the way the convertor was designed to work.

Here are the output of the ddt engine for SM.

First the pack side:

[applebasket.cs.utk.edu:16760] ompi_convertor_generic_simple_pack ( 0xbfffc104, {0x2811430, 4560}, 1 ) [applebasket.cs.utk.edu:16760] unpack start pos_desc 0 count_desc 6 disp 0
stack_pos 0 pos_desc -1 count_desc 1 disp 0
[applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811430, 0xac650, 96 ) => space 4560 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811490, 0xac7e0, 112 ) => space 4464 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811500, 0xac970, 128 ) => space 4352 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811580, 0xacb00, 144 ) => space 4224 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811610, 0xacc90, 160 ) => space 4080 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x28116b0, 0xace20, 176 ) => space 3920 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811760, 0xacfb0, 192 ) => space 3744 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811820, 0xad140, 208 ) => space 3552 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x28118f0, 0xad2d0, 224 ) => space 3344 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x28119d0, 0xad460, 240 ) => space 3120 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811ac0, 0xad5f0, 256 ) => space 2880 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811bc0, 0xad780, 272 ) => space 2624 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811cd0, 0xad910, 288 ) => space 2352 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811df0, 0xadaa0, 304 ) => space 2064 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2811f20, 0xadc30, 320 ) => space 1760 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2812060, 0xaddc0, 336 ) => space 1440 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x28121b0, 0xadf50, 352 ) => space 1104 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2812310, 0xae0e0, 368 ) => space 752 [applebasket.cs.utk.edu:16760] pack 1. memcpy( 0x2812480, 0xae270, 384 ) => space 384 [applebasket.cs.utk.edu:16760] pack end_loop count 1 stack_pos 0 pos_desc 19 disp 0 space 0

As you can see there is one pack operation with a buffer of 4560 bytes ... exactly the size of the whole data. Even if the pack pay attention to not cut a basic type in the middle, in this particular case it has enough data to do it's job correctly.

The receiver side look a little bit different:

[applebasket.cs.utk.edu:16758] ompi_convertor_generic_simple_unpack ( 0x280bf04, {0x229e15c, 956}, 1 ) [applebasket.cs.utk.edu:16758] unpack start pos_desc 0 count_desc 6 disp 0
stack_pos 0 pos_desc -1 count_desc 1 disp 0
[applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xac650, 0x229e15c, 96 ) => space 956 [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xac7e0, 0x229e1bc, 112 ) => space 860 [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xac970, 0x229e22c, 128 ) => space 748 [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xacb00, 0x229e2ac, 144 ) => space 620 [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xacc90, 0x229e33c, 160 ) => space 476 [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xace20, 0x229e3dc, 176 ) => space 316 [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xacfb0, 0x229e48c, 128 ) => space 140
[applebasket.cs.utk.edu:16758] Losing 12 bytes !!!
[applebasket.cs.utk.edu:16758] unpack save stack stack_pos 1 pos_desc 6 count_desc 4 disp 128 [applebasket.cs.utk.edu:16758] ompi_convertor_generic_simple_unpack ( 0x280bf04, {0x229e158, 3604}, 1 ) [applebasket.cs.utk.edu:16758] unpack start pos_desc 6 count_desc 4 disp 128
stack_pos 0 pos_desc -1 count_desc 1 disp 0
[applebasket.cs.utk.edu:16758] unpack pending from the last unpack 12 out of 16 bytes [applebasket.cs.utk.edu:16758] unpack 1. memcpy( 0xad030, 0x280bf4c, 16 ) => space 16
... (skipped)

We can see the trace of 2 unpack operations, one with a size of 956 bytes and the other with 3604. In the middle of the previous text you can notice the "Losing 12 bytes !!!" message. The basic type here is a long double (16 bytes on this machine) so we definitively stop in the middle of a basic type.

A correct usage of the convertor could prevent such problems. Anyway, now the convertor will remember such kind of errors and will automatically correct them (the cost is just an if in the critical path and some extra memory in the convertor struct).

  george.

"Half of what I say is meaningless; but I say it so that the other half may reach you"
                                  Kahlil Gibran


Reply via email to