I've rerun my NetPIPE tests using --mca btl ^uct as Yossi suggested
and that
does indeed get rid of the message failures. I don't see any difference in
performance but wanted to check if there is any downside to doing the build
without uct as suggested.
Dave Turner
>
> Today's Topics:
>
> 1. Re: Seeing message failures in OpenMPI 4.0.1 on UCX (Yossi Itigin)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 15 May 2019 10:57:49 +0000
> From: Yossi Itigin <[email protected]>
> To: "[email protected]" <[email protected]>
> Cc: Pavel Shamis <[email protected]>, "[email protected]"
> <[email protected]>, Sergey Oblomov <[email protected]>
> Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on
> UCX
> Message-ID:
> <
> am0pr05mb6115e809ad58f76189805b99a6...@am0pr05mb6115.eurprd05.prod.outlook.com
> >
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> (resending this edited email due to 150kB limit on email message size)
>
> The issue below is caused by btl_uct ? it mistakenly calls
> ucm_set_external_event() without checking opal_mem_hooks_support_level().
> This leads UCX to believe that memory hooks would be provided by OMPI, but
> in fact they are not, so pinned physical pages become out-of-sync with
> process virtual address.
>
> * btl_uct wrong call:
> https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/uct/btl_uct_component.c#L132
> * Correct way:
> https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.c#L104
>
> Since btl_uct component currently does have a maintainer, my best
> suggestion is to either disable it in OMPI configure (as described in
> https://github.com/open-mpi/ompi/issues/6640#issuecomment-490465625), or
> during runtime : ?mpirun -mca btl ^uct ??
>
> UCX reference issue: https://github.com/openucx/ucx/issues/3581
>
> --Yossi
>
> From: Dave Turner <[email protected]<mailto:[email protected]>>
> Sent: Monday, April 22, 2019 10:26 PM
> To: Yossi Itigin <[email protected]<mailto:[email protected]>>
> Cc: Pavel Shamis <[email protected]<mailto:[email protected]>>;
> [email protected]<mailto:[email protected]>; Sergey Oblomov
> <[email protected]<mailto:[email protected]>>
> Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX
>
> Yossi,
>
> I reran the base NetPIPE test then added --mca
> opal_common_ucx_opal_mem_hooks 1 as
> you suggested but got the same failures and no warning messages. Let me
> know if
> there is anything else you'd like me to try.
>
> Dave Turner
>
> Elf22 module purge
> Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile
> hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames
> Saving output to np.elf.mpi-4.0.1-ucx-ib
>
> Proc 0 is on host elf22
>
> Proc 1 is on host elf23
>
> Clock resolution ~ 1.000 nsecs Clock accuracy ~ 33.000 nsecs
>
> Start testing with 7 trials for each message size
> 1: 1 B 24999 times --> 2.754 Mbps in 2.904 usecs
> 2: 2 B 86077 times --> 5.536 Mbps in 2.890 usecs
> <? cut ?>
> 91: 196.611 KB 1465 times --> 9.222 Gbps in 170.561 usecs
> 92: 262.141 KB 1465 times --> 9.352 Gbps in 224.245 usecs
> 93: 262.144 KB 1114 times --> 9.246 Gbps in 226.826 usecs
> 94: 262.147 KB 1102 times --> 9.243 Gbps in 226.883 usecs
> 95: 393.213 KB 1101 times --> 9.413 Gbps in 334.177 usecs 1
> failures
> 96: 393.216 KB 748 times --> 9.418 Gbps in 334.005 usecs
> 10472 failures
> 97: 393.219 KB 748 times --> 9.413 Gbps in 334.201 usecs 1
> failures
> 98: 524.285 KB 748 times --> 9.498 Gbps in 441.601 usecs 1
> failures
> <? cut ?>
> 120: 6.291 MB 48 times --> 9.744 Gbps in 5.166 msecs 672
> failures
> 121: 6.291 MB 48 times --> 9.736 Gbps in 5.170 msecs 1
> failures
> 122: 8.389 MB 48 times --> 9.744 Gbps in 6.887 msecs 1
> failures
> 123: 8.389 MB 36 times --> 9.750 Gbps in 6.883 msecs 504
> failures
> 124: 8.389 MB 36 times --> 9.739 Gbps in 6.891 msecs 1
> failures
>
> Completed with max bandwidth 9.737 Gbps 2.908 usecs
> latency
>
>
> Work: [email protected] (785) 532-7791
2219 Engineering Hall, Manhattan KS 66506
Home: [email protected]
cell: (785) 770-5929
_______________________________________________
devel mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/devel