Things that read like they should be unsigned look suspicious to me:

nbElems -909934592
count -1819869184

Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov




> On Nov 1, 2018, at 10:34 PM, Ben Menadue <ben.mena...@nci.org.au> wrote:
> 
> Hi,
> 
> I haven’t heard back from the user yet, but I just put this example together 
> which works on 1, 2, and 3 ranks but fails for 4. Unfortunately it needs a 
> fair amount of memory, about 14.3GB per process, so I was running it with 
> -map-by ppr:1:node.
> 
> It doesn’t fail with the segfault as the user’s code does, but it does 
> SIGABRT:
> 
> 16:12 bjm900@r4320 MPI_TESTS > mpirun -mca pml ob1 -mca coll ^fca,hcoll 
> -map-by ppr:1:node -np 4 ./a.out
> [r4450:11544] ../../../../../opal/datatype/opal_datatype_pack.h:53
>       Pointer 0x2bb7ceedb010 size 131040 is outside 
> [0x2b9ec63cb010,0x2bad1458b010] for
>       base ptr 0x2b9ec63cb010 count 1 and data 
> [r4450:11544] Datatype 0x145fe90[] size 30720000000 align 4 id 0 length 7 
> used 6
> true_lb 0 true_ub 61440000000 (true_extent 61440000000) lb 0 ub 61440000000 
> (extent 61440000000)
> nbElems -909934592 loops 4 flags 104 (committed )-c-----GD--[---][---]
>    contain OPAL_FLOAT4:* 
> --C--------[---][---]    OPAL_LOOP_S 192 times the next 2 elements extent 
> 80000000
> --C---P-D--[---][---]    OPAL_FLOAT4 count 20000000 disp 0xaba950000 
> (46080000000) blen 0 extent 4 (size 80000000)
> --C--------[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 
> 46080000000 size of data 80000000
> --C--------[---][---]    OPAL_LOOP_S 192 times the next 2 elements extent 
> 80000000
> --C---P-D--[---][---]    OPAL_FLOAT4 count 20000000 disp 0x0 (0) blen 0 
> extent 4 (size 80000000)
> --C--------[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 
> 0 size of data 80000000
> -------G---[---][---]    OPAL_LOOP_E prev 6 elements first elem displacement 
> 46080000000 size of data 655228928
> Optimized description 
> -cC---P-DB-[---][---]     OPAL_UINT1 count -1819869184 disp 0xaba950000 
> (46080000000) blen 1 extent 1 (size 15360000000)
> -cC---P-DB-[---][---]     OPAL_UINT1 count -1819869184 disp 0x0 (0) blen 1 
> extent 1 (size 15360000000)
> -------G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 
> 46080000000 
> [r4450:11544] *** Process received signal ***
> [r4450:11544] Signal: Aborted (6)
> [r4450:11544] Signal code:  (-6)
> 
> Cheers,
> Ben
> 
> <allgatherv_failure.c><ATT00001.html>_______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to