I went over the provided trace file and I tried to force the BTLs to handle extremely weird (and uncomfortable) lengths, on both Mac OS X and Linux 64b. Despite all my efforts I was unable to reproduce this error. So I'm giving up until more information become available.
George. On Thu, Apr 17, 2014 at 11:28 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > I sent this information to George off the mailing list since the attachment > was somewhat large. > Still strange that I guess I am the only one that sees this. > >>-----Original Message----- >>From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of George >>Bosilca >>Sent: Wednesday, April 16, 2014 4:24 PM >>To: Open MPI Developers >>Subject: Re: [OMPI devel] Possible bug with derived datatypes and openib >>BTL in trunk >> >>Rolf, >> >>I didn't see these on my check run. Can you run the MPI_Isend_ator test with >>mpi_ddt_pack_debug and mpi_ddt_unpack_debug set to 1. I would be >>interested in the output you get on your machine. >> >>George. >> >> >>On Apr 16, 2014, at 14:34 , Rolf vandeVaart <rvandeva...@nvidia.com> wrote: >> >>> I have seen errors when running the intel test suite using the openib BTL >>when transferring derived datatypes. I do not see the error with sm or tcp >>BTLs. The errors begin after this checkin. >>> >>> https://svn.open-mpi.org/trac/ompi/changeset/31370 >>> Timestamp: 04/11/14 16:06:56 (5 days ago) >>> Author: bosilca >>> Message: Reshape all the packing/unpacking functions to use the same >>> skeleton. Rewrite the generic_unpacking to take advantage of the same >>capabilitites. >>> >>> Does anyone else see errors? Here is an example running with r31370: >>> >>> [rvandevaart@drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2 >>> -host drossetti-ivy0,drossetti-ivy1 --mca >>> btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c MPITEST error >>> (1): libmpitest.c:1608 i=117, int32_t value=-1, expected 117 MPITEST >>> error (1): libmpitest.c:1578 i=195, char value=-1, expected -61 >>> MPITEST error (1): 2 errors in buffer (17,0,12) len 273 commsize 2 >>> commtype -10 data_type 13 root 1 MPITEST error (1): libmpitest.c:1608 >>> i=117, int32_t value=-1, expected 117 MPITEST error (1): >>> libmpitest.c:1578 i=195, char value=-1, expected -61 MPITEST error >>> (1): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16 >>> data_type 13 root 1 MPITEST info (0): Starting MPI_Isend_ator: All >>> Isend TO Root test MPITEST info (0): Node spec >>> MPITEST_comm_sizes[6]=2 too large, using 1 MPITEST info (0): Node >>> spec MPITEST_comm_sizes[22]=2 too large, using 1 MPITEST info (0): >>> Node spec MPITEST_comm_sizes[32]=2 too large, using 1 MPITEST error >>> (0): libmpitest.c:1608 i=117, int32_t value=-1, expected 118 MPITEST >>> error (0): libmpitest.c:1578 i=195, char value=-1, expected -60 >>> MPITEST error (0): 2 errors in buffer (17,0,12) len 273 commsize 2 >>> commtype -10 data_type 13 root 0 MPITEST error (0): libmpitest.c:1608 >>> i=117, int32_t value=-1, expected 118 MPITEST error (0): >>> libmpitest.c:1578 i=195, char value=-1, expected -60 MPITEST error >>> (0): 2 errors in buffer (17,2,12) len 273 commsize 2 commtype -16 >>> data_type 13 root 0 MPITEST error (1): libmpitest.c:1608 i=117, >>> int32_t value=-1, expected 117 MPITEST error (1): libmpitest.c:1578 >>> i=195, char value=-1, expected -61 MPITEST error (1): 2 errors in >>> buffer (17,4,12) len 273 commsize 2 commtype -13 data_type 13 root 1 >>> MPITEST error (0): libmpitest.c:1608 i=117, int32_t value=-1, expected >>> 118 MPITEST error (0): libmpitest.c:1578 i=195, char value=-1, >>> expected -60 MPITEST error (0): 2 errors in buffer (17,4,12) len 273 >>> commsize 2 commtype -13 data_type 13 root 0 MPITEST error (1): >>> libmpitest.c:1608 i=117, int32_t value=-1, expected 117 MPITEST error >>> (1): libmpitest.c:1578 i=195, char value=-1, expected -61 MPITEST >>> error (1): 2 errors in buffer (17,6,12) len 273 commsize 2 commtype >>> -15 data_type 13 root 0 MPITEST error (0): libmpitest.c:1608 i=117, >>> int32_t value=-1, expected 117 MPITEST error (0): libmpitest.c:1578 >>> i=195, char value=-1, expected -61 MPITEST error (0): 2 errors in >>> buffer (17,6,12) len 273 commsize 2 commtype -15 data_type 13 root 0 >>> MPITEST_results: MPI_Isend_ator: All Isend TO Root 8 tests FAILED (of >>> 3744) >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned a non-zero >>> exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> ---------------------------------------------------------------------- >>> ---- mpirun detected that one or more processes exited with non-zero >>> status, thus causing the job to be terminated. The first process to do >>> so was: >>> >>> Process name: [[12363,1],0] >>> Exit code: 4 >>> ---------------------------------------------------------------------- >>> ---- >>> [rvandevaart@drossetti-ivy1 src]$ >>> >>> >>> Here is an error with the trunk which is slightly different. >>> [rvandevaart@drossetti-ivy1 src]$ mpirun --mca btl self,openib -np 2 >>> -host drossetti-ivy0,drossetti-ivy1 --mca >>btl_openib_warn_default_gid_prefix 0 MPI_Isend_ator_c [drossetti- >>ivy1.nvidia.com:22875] ../../../opal/datatype/opal_datatype_position.c:72 >>> Pointer 0x1ad414c size 4 is outside [0x1ac1d20,0x1ad1d08] for >>> base ptr 0x1ac1d20 count 273 and data >>> [drossetti-ivy1.nvidia.com:22875] Datatype 0x1ac0220[] size 104 align >>> 16 id 0 length 22 used 21 true_lb 0 true_ub 232 (true_extent 232) lb 0 >>> ub 240 (extent 240) nbElems 21 loops 0 flags 1C4 (commited )-c--lu-GD--[--- >>][---] >>> contain lb ub OPAL_LB OPAL_UB OPAL_INT1 OPAL_INT2 OPAL_INT4 >>OPAL_INT8 OPAL_UINT1 OPAL_UINT2 OPAL_UINT4 OPAL_UINT8 >>OPAL_FLOAT4 OPAL_FLOAT8 OPAL_FLOAT16 >>> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4) >>> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2) >>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size >>> 8) >>> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size >>> 2) >>> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size >>> 4) >>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size >>> 8) >>> --C---P-D--[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size >>> 4) >>> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size >>> 1) >>> --C---P-D--[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size >>> 8) >>> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size >>> 1) >>> --C---P-D--[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size >>16) >>> --C---P-D--[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size >>> 1) >>> --C---P-D--[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size >>> 1) >>> --C---P-D--[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size >>> 2) >>> --C---P-D--[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size >>> 2) >>> --C---P-D--[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size >>> 4) >>> --C---P-D--[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size >>> 4) >>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size >>> 8) >>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size >>> 8) >>> --C---P-D--[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size >>> 8) >>> --C---P-D--[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size >>> 8) >>> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem >>> displacement 0 size of data 104 Optimized description >>> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x0 (0) extent 4 (size 4) >>> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x8 (8) extent 2 (size 2) >>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0x10 (16) extent 8 (size >>> 8) >>> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x20 (32) extent 2 (size >>> 2) >>> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0x24 (36) extent 4 (size >>> 4) >>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0x30 (48) extent 8 (size >>> 8) >>> -cC---P-DB-[---][---] OPAL_FLOAT4 count 1 disp 0x40 (64) extent 4 (size >>> 4) >>> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x48 (72) extent 1 (size >>> 1) >>> -cC---P-DB-[---][---] OPAL_FLOAT8 count 1 disp 0x50 (80) extent 8 (size >>> 8) >>> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x60 (96) extent 1 (size >>> 1) >>> -cC---P-DB-[---][---] OPAL_FLOAT16 count 1 disp 0x70 (112) extent 16 (size >>16) >>> -cC---P-DB-[---][---] OPAL_INT1 count 1 disp 0x90 (144) extent 1 (size >>> 1) >>> -cC---P-DB-[---][---] OPAL_UINT1 count 1 disp 0x92 (146) extent 1 (size >>> 1) >>> -cC---P-DB-[---][---] OPAL_INT2 count 1 disp 0x94 (148) extent 2 (size >>> 2) >>> -cC---P-DB-[---][---] OPAL_UINT2 count 1 disp 0x98 (152) extent 2 (size >>> 2) >>> -cC---P-DB-[---][---] OPAL_INT4 count 1 disp 0x9c (156) extent 4 (size >>> 4) >>> -cC---P-DB-[---][---] OPAL_UINT4 count 1 disp 0xa4 (164) extent 4 (size >>> 4) >>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xb0 (176) extent 8 (size >>> 8) >>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xc0 (192) extent 8 (size >>> 8) >>> -cC---P-DB-[---][---] OPAL_INT8 count 1 disp 0xd0 (208) extent 8 (size >>> 8) >>> -cC---P-DB-[---][---] OPAL_UINT8 count 1 disp 0xe0 (224) extent 8 (size >>> 8) >>> -------G---[---][---] OPAL_END_LOOP prev 21 elements first elem >>> displacement 0 size of data 104 >>> >>> MPITEST error (1): libmpitest.c:1578 i=0, char value=-61, expected 0 >>> MPITEST error (1): libmpitest.c:1608 i=0, int32_t value=117, expected >>> 0 MPITEST error (1): libmpitest.c:1608 i=117, int32_t value=-1, >>> expected 117 MPITEST error (1): libmpitest.c:1578 i=195, char >>> value=-1, expected -61 MPITEST error (1): 4 errors in buffer (17,0,12) >>> len 273 commsize 2 commtype -10 data_type 13 root 1 MPITEST info (0): >>> Starting MPI_Isend_ator: All Isend TO Root test MPITEST info (0): >>> Node spec MPITEST_comm_sizes[6]=2 too large, using 1 MPITEST info >>> (0): Node spec MPITEST_comm_sizes[22]=2 too large, using 1 MPITEST >>> info (0): Node spec MPITEST_comm_sizes[32]=2 too large, using 1 >>> MPITEST_results: MPI_Isend_ator: All Isend TO Root 1 tests FAILED (of >>> 3744) >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned a non-zero >>> exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> ---------------------------------------------------------------------- >>> ---- mpirun detected that one or more processes exited with non-zero >>> status, thus causing the job to be terminated. The first process to do >>> so was: >>> >>> Process name: [[12296,1],1] >>> Exit code: 1 >>> ---------------------------------------------------------------------- >>> ---- >>> [rvandevaart@drossetti-ivy1 src]$ >>> >>> ---------------------------------------------------------------------- >>> ------------- This email message is for the sole use of the intended >>> recipient(s) and may contain confidential information. Any >>> unauthorized review, use, disclosure or distribution is prohibited. >>> If you are not the intended recipient, please contact the sender by >>> reply email and destroy all copies of the original message. >>> ---------------------------------------------------------------------- >>> ------------- _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/04/14553.php >> >>_______________________________________________ >>devel mailing list >>de...@open-mpi.org >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>Link to this post: http://www.open- >>mpi.org/community/lists/devel/2014/04/14554.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/04/14559.php