Hi,

One of our users is reporting an issue using MPI_Allgatherv with a large 
derived datatype — it segfaults inside OpenMPI. Using a debug build of OpenMPI 
3.1.2 produces a ton of messages like this before the segfault:

[r3816:50921] ../../../../../opal/datatype/opal_datatype_pack.h:53
        Pointer 0x2acd0121b010 size 131040 is outside 
[0x2ac5ed268010,0x2ac980ad8010] for
        base ptr 0x2ac5ed268010 count 1 and data 
[r3816:50921] Datatype 0x42998b0[] size 5920000000 align 4 id 0 length 7 used 6
true_lb 0 true_ub 15360000000 (true_extent 15360000000) lb 0 ub 15360000000 
(extent 15360000000)
nbElems 1480000000 loops 4 flags 104 (committed )-c-----GD--[---][---]
   contain OPAL_FLOAT4:* 
--C--------[---][---]    OPAL_LOOP_S 4 times the next 2 elements extent 80000000
--C---P-D--[---][---]    OPAL_FLOAT4 count 20000000 disp 0x380743000 
(15040000000) blen 0 extent 4 (size 80000000)
--C--------[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 
15040000000 size of data 80000000
--C--------[---][---]    OPAL_LOOP_S 70 times the next 2 elements extent 
80000000
--C---P-D--[---][---]    OPAL_FLOAT4 count 20000000 disp 0x0 (0) blen 0 extent 
4 (size 80000000)
--C--------[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 
size of data 80000000
-------G---[---][---]    OPAL_LOOP_E prev 6 elements first elem displacement 
15040000000 size of data 1625032704
Optimized description 
-cC---P-DB-[---][---]     OPAL_UINT1 count 320000000 disp 0x380743000 
(15040000000) blen 1 extent 1 (size 320000000)
-cC---P-DB-[---][---]     OPAL_UINT1 count 1305032704 disp 0x0 (0) blen 1 
extent 1 (size 5600000000)
-------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 
15040000000 size of d

Here is the backtrace:

==== backtrace ====
 0 0x000000000008987b memcpy()  ???:0
 1 0x00000000000639b6 opal_cuda_memcpy()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_cuda.c:99
 2 0x000000000005cd7a pack_predefined_data()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_pack.h:56
 3 0x000000000005e845 opal_generic_simple_pack()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_pack.c:319
 4 0x000000000004ce6e opal_convertor_pack()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_convertor.c:272
 5 0x000000000000e3b6 mca_btl_openib_prepare_src()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib.c:1609
 6 0x0000000000023c75 mca_bml_base_prepare_src()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/bml/bml.h:341
 7 0x0000000000027d2a mca_pml_ob1_send_request_schedule_once()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.c:995
 8 0x000000000002473c mca_pml_ob1_send_request_schedule_exclusive()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.h:313
 9 0x000000000002479d mca_pml_ob1_send_request_schedule()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.h:337
10 0x00000000000256fe mca_pml_ob1_frag_completion()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.c:321
11 0x000000000001baaf handle_wc()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3565
12 0x000000000001c20c poll_device()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3719
13 0x000000000001c6c0 progress_one_device()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3829
14 0x000000000001c763 btl_openib_component_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3853
15 0x000000000002ff90 opal_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/../../../../opal/runtime/opal_progress.c:228
16 0x000000000001114c ompi_request_wait_completion()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/request/request.h:413
17 0x0000000000013a80 mca_pml_ob1_send()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_isend.c:266
18 0x000000000010ca45 ompi_coll_base_sendrecv_actual()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.c:55
19 0x000000000010b5bc ompi_coll_base_sendrecv()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_util.h:67
20 0x000000000010ba1e ompi_coll_base_allgatherv_intra_bruck()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/../../../../../../ompi/mca/coll/base/coll_base_allgatherv.c:184
21 0x0000000000005ac5 ompi_coll_tuned_allgatherv_intra_dec_fixed()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/coll/tuned/../../../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:640
22 0x000000000007c40d PMPI_Allgatherv()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mpi/c/profile/pallgatherv.c:143
23 0x0000000000401e25 main()  
/short/z00/bjm900/help/pxs599/memtest.2/memtest1.c:182
24 0x000000000001ed20 __libc_start_main()  ???:0
25 0x00000000004012b9 _start()  ???:0
===================

The derived datatype is produced using
        MPI_Type_contiguous(P, MPI_FLOAT, &mpitype_vec_nobs)
where P = 20000000 (so quite large).

Is there any restriction on the maximum size a datatype can be? Or, perhaps on 
the extent a message can cover, since the Allgatherv creates its own internal 
datatypes?

Thanks,
Ben

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to