I've rerun my NetPIPE tests using --mca btl ^uct as Yossi suggested and that does indeed get rid of the message failures. I don't see any difference in performance but wanted to check if there is any downside to doing the build without uct as suggested.
Dave Turner > > Today's Topics: > > 1. Re: Seeing message failures in OpenMPI 4.0.1 on UCX (Yossi Itigin) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 15 May 2019 10:57:49 +0000 > From: Yossi Itigin <yos...@mellanox.com> > To: "drdavetur...@gmail.com" <drdavetur...@gmail.com> > Cc: Pavel Shamis <pasharesea...@gmail.com>, "devel@lists.open-mpi.org" > <devel@lists.open-mpi.org>, Sergey Oblomov <serg...@mellanox.com> > Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on > UCX > Message-ID: > < > am0pr05mb6115e809ad58f76189805b99a6...@am0pr05mb6115.eurprd05.prod.outlook.com > > > > Content-Type: text/plain; charset="utf-8" > > Hi, > > (resending this edited email due to 150kB limit on email message size) > > The issue below is caused by btl_uct ? it mistakenly calls > ucm_set_external_event() without checking opal_mem_hooks_support_level(). > This leads UCX to believe that memory hooks would be provided by OMPI, but > in fact they are not, so pinned physical pages become out-of-sync with > process virtual address. > > * btl_uct wrong call: > https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/uct/btl_uct_component.c#L132 > * Correct way: > https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.c#L104 > > Since btl_uct component currently does have a maintainer, my best > suggestion is to either disable it in OMPI configure (as described in > https://github.com/open-mpi/ompi/issues/6640#issuecomment-490465625), or > during runtime : ?mpirun -mca btl ^uct ?? > > UCX reference issue: https://github.com/openucx/ucx/issues/3581 > > --Yossi > > From: Dave Turner <drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>> > Sent: Monday, April 22, 2019 10:26 PM > To: Yossi Itigin <yos...@mellanox.com<mailto:yos...@mellanox.com>> > Cc: Pavel Shamis <pasharesea...@gmail.com<mailto:pasharesea...@gmail.com>>; > devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>; Sergey Oblomov > <serg...@mellanox.com<mailto:serg...@mellanox.com>> > Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX > > Yossi, > > I reran the base NetPIPE test then added --mca > opal_common_ucx_opal_mem_hooks 1 as > you suggested but got the same failures and no warning messages. Let me > know if > there is anything else you'd like me to try. > > Dave Turner > > Elf22 module purge > Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile > hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames > Saving output to np.elf.mpi-4.0.1-ucx-ib > > Proc 0 is on host elf22 > > Proc 1 is on host elf23 > > Clock resolution ~ 1.000 nsecs Clock accuracy ~ 33.000 nsecs > > Start testing with 7 trials for each message size > 1: 1 B 24999 times --> 2.754 Mbps in 2.904 usecs > 2: 2 B 86077 times --> 5.536 Mbps in 2.890 usecs > <? cut ?> > 91: 196.611 KB 1465 times --> 9.222 Gbps in 170.561 usecs > 92: 262.141 KB 1465 times --> 9.352 Gbps in 224.245 usecs > 93: 262.144 KB 1114 times --> 9.246 Gbps in 226.826 usecs > 94: 262.147 KB 1102 times --> 9.243 Gbps in 226.883 usecs > 95: 393.213 KB 1101 times --> 9.413 Gbps in 334.177 usecs 1 > failures > 96: 393.216 KB 748 times --> 9.418 Gbps in 334.005 usecs > 10472 failures > 97: 393.219 KB 748 times --> 9.413 Gbps in 334.201 usecs 1 > failures > 98: 524.285 KB 748 times --> 9.498 Gbps in 441.601 usecs 1 > failures > <? cut ?> > 120: 6.291 MB 48 times --> 9.744 Gbps in 5.166 msecs 672 > failures > 121: 6.291 MB 48 times --> 9.736 Gbps in 5.170 msecs 1 > failures > 122: 8.389 MB 48 times --> 9.744 Gbps in 6.887 msecs 1 > failures > 123: 8.389 MB 36 times --> 9.750 Gbps in 6.883 msecs 504 > failures > 124: 8.389 MB 36 times --> 9.739 Gbps in 6.891 msecs 1 > failures > > Completed with max bandwidth 9.737 Gbps 2.908 usecs > latency > > > Work: davetur...@ksu.edu (785) 532-7791 2219 Engineering Hall, Manhattan KS 66506 Home: drdavetur...@gmail.com cell: (785) 770-5929
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel