Hi, One of our users is reporting an issue using MPI_Allgatherv with a large derived datatype — it segfaults inside OpenMPI. Using a debug build of OpenMPI 3.1.2 produces a ton of messages like this before the segfault:
[r3816:50921] ../../../../../opal/datatype/opal_datatype_pack.h:53 Pointer 0x2acd0121b010 size 131040 is outside [0x2ac5ed268010,0x2ac980ad8010] for base ptr 0x2ac5ed268010 count 1 and data [r3816:50921] Datatype 0x42998b0[] size 5920000000 align 4 id 0 length 7 used 6 true_lb 0 true_ub 15360000000 (true_extent 15360000000) lb 0 ub 15360000000 (extent 15360000000) nbElems 1480000000 loops 4 flags 104 (committed )-c-----GD--[---][---] contain OPAL_FLOAT4:* --C--------[---][---] OPAL_LOOP_S 4 times the next 2 elements extent 80000000 --C---P-D--[---][---] OPAL_FLOAT4 count 20000000 disp 0x380743000 (15040000000) blen 0 extent 4 (size 80000000) --C--------[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 15040000000 size of data 80000000 --C--------[---][---] OPAL_LOOP_S 70 times the next 2 elements extent 80000000 --C---P-D--[---][---] OPAL_FLOAT4 count 20000000 disp 0x0 (0) blen 0 extent 4 (size 80000000) --C--------[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 80000000 -------G---[---][---] OPAL_LOOP_E prev 6 elements first elem displacement 15040000000 size of data 1625032704 Optimized description -cC---P-DB-[---][---] OPAL_UINT1 count 320000000 disp 0x380743000 (15040000000) blen 1 extent 1 (size 320000000) -cC---P-DB-[---][---] OPAL_UINT1 count 1305032704 disp 0x0 (0) blen 1 extent 1 (size 5600000000) -------G---[---][---] OPAL_LOOP_E prev 2 elements first elem displacement 15040000000 size of d Here is the backtrace: ==== backtrace ==== 0 0x000000000008987b memcpy() ???:0 1 0x00000000000639b6 opal_cuda_memcpy() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_cuda.c:99 2 0x000000000005cd7a pack_predefined_data() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_pack.h:56 3 0x000000000005e845 opal_generic_simple_pack() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_pack.c:319 4 0x000000000004ce6e opal_convertor_pack() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_convertor.c:272 5 0x000000000000e3b6 mca_btl_openib_prepare_src() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib.c:1609 6 0x0000000000023c75 mca_bml_base_prepare_src() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/bml/bml.h:341 7 0x0000000000027d2a mca_pml_ob1_send_request_schedule_once() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.c:995 8 0x000000000002473c mca_pml_ob1_send_request_schedule_exclusive() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.h:313 9 0x000000000002479d mca_pml_ob1_send_request_schedule() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.h:337 10 0x00000000000256fe mca_pml_ob1_frag_completion() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.c:321 11 0x000000000001baaf handle_wc() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3565 12 0x000000000001c20c poll_device() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3719 13 0x000000000001c6c0 progress_one_device() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3829 14 0x000000000001c763 btl_openib_component_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3853 15 0x000000000002ff90 opal_progress() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/../../../../opal/runtime/opal_progress.c:228 16 0x000000000001114c ompi_request_wait_completion() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/request/request.h:413 17 0x0000000000013a80 mca_pml_ob1_send() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_isend.c:266 18 0x000000000010ca45 ompi_coll_base_sendrecv_actual() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.c:55 19 0x000000000010b5bc ompi_coll_base_sendrecv() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.h:67 20 0x000000000010ba1e ompi_coll_base_allgatherv_intra_bruck() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_allgatherv.c:184 21 0x0000000000005ac5 ompi_coll_tuned_allgatherv_intra_dec_fixed() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:640 22 0x000000000007c40d PMPI_Allgatherv() /short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mpi/c/profile/pallgatherv.c:143 23 0x0000000000401e25 main() /short/z00/bjm900/help/pxs599/memtest.2/memtest1.c:182 24 0x000000000001ed20 __libc_start_main() ???:0 25 0x00000000004012b9 _start() ???:0 =================== The derived datatype is produced using MPI_Type_contiguous(P, MPI_FLOAT, &mpitype_vec_nobs) where P = 20000000 (so quite large). Is there any restriction on the maximum size a datatype can be? Or, perhaps on the extent a message can cover, since the Allgatherv creates its own internal datatypes? Thanks, Ben
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel