Hi, (resending this edited email due to 150kB limit on email message size)
The issue below is caused by btl_uct – it mistakenly calls ucm_set_external_event() without checking opal_mem_hooks_support_level(). This leads UCX to believe that memory hooks would be provided by OMPI, but in fact they are not, so pinned physical pages become out-of-sync with process virtual address. * btl_uct wrong call: https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/uct/btl_uct_component.c#L132 * Correct way: https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.c#L104 Since btl_uct component currently does have a maintainer, my best suggestion is to either disable it in OMPI configure (as described in https://github.com/open-mpi/ompi/issues/6640#issuecomment-490465625), or during runtime : “mpirun -mca btl ^uct …” UCX reference issue: https://github.com/openucx/ucx/issues/3581 --Yossi From: Dave Turner <drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>> Sent: Monday, April 22, 2019 10:26 PM To: Yossi Itigin <yos...@mellanox.com<mailto:yos...@mellanox.com>> Cc: Pavel Shamis <pasharesea...@gmail.com<mailto:pasharesea...@gmail.com>>; devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>; Sergey Oblomov <serg...@mellanox.com<mailto:serg...@mellanox.com>> Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX Yossi, I reran the base NetPIPE test then added --mca opal_common_ucx_opal_mem_hooks 1 as you suggested but got the same failures and no warning messages. Let me know if there is anything else you'd like me to try. Dave Turner Elf22 module purge Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames Saving output to np.elf.mpi-4.0.1-ucx-ib Proc 0 is on host elf22 Proc 1 is on host elf23 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 33.000 nsecs Start testing with 7 trials for each message size 1: 1 B 24999 times --> 2.754 Mbps in 2.904 usecs 2: 2 B 86077 times --> 5.536 Mbps in 2.890 usecs <… cut …> 91: 196.611 KB 1465 times --> 9.222 Gbps in 170.561 usecs 92: 262.141 KB 1465 times --> 9.352 Gbps in 224.245 usecs 93: 262.144 KB 1114 times --> 9.246 Gbps in 226.826 usecs 94: 262.147 KB 1102 times --> 9.243 Gbps in 226.883 usecs 95: 393.213 KB 1101 times --> 9.413 Gbps in 334.177 usecs 1 failures 96: 393.216 KB 748 times --> 9.418 Gbps in 334.005 usecs 10472 failures 97: 393.219 KB 748 times --> 9.413 Gbps in 334.201 usecs 1 failures 98: 524.285 KB 748 times --> 9.498 Gbps in 441.601 usecs 1 failures <… cut …> 120: 6.291 MB 48 times --> 9.744 Gbps in 5.166 msecs 672 failures 121: 6.291 MB 48 times --> 9.736 Gbps in 5.170 msecs 1 failures 122: 8.389 MB 48 times --> 9.744 Gbps in 6.887 msecs 1 failures 123: 8.389 MB 36 times --> 9.750 Gbps in 6.883 msecs 504 failures 124: 8.389 MB 36 times --> 9.739 Gbps in 6.891 msecs 1 failures Completed with max bandwidth 9.737 Gbps 2.908 usecs latency Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --mca opal_common_ucx_opal_mem_hooks 1 --hostfile hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames Saving output to np.elf.mpi-4.0.1-ucx-ib Proc 0 is on host elf22 Proc 1 is on host elf23 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 34.000 nsecs Start testing with 7 trials for each message size 1: 1 B 24999 times --> 2.750 Mbps in 2.909 usecs 2: 2 B 85939 times --> 5.527 Mbps in 2.895 usecs <… cut …> 87: 131.072 KB 2142 times --> 8.986 Gbps in 116.693 usecs 88: 131.075 KB 2142 times --> 8.987 Gbps in 116.683 usecs 89: 196.605 KB 2142 times --> 9.220 Gbps in 170.584 usecs 90: 196.608 KB 1465 times --> 9.221 Gbps in 170.577 usecs 91: 196.611 KB 1465 times --> 9.222 Gbps in 170.550 usecs 92: 262.141 KB 1465 times --> 9.352 Gbps in 224.250 usecs 93: 262.144 KB 1114 times --> 9.246 Gbps in 226.805 usecs 94: 262.147 KB 1102 times --> 9.244 Gbps in 226.860 usecs 95: 393.213 KB 1102 times --> 9.413 Gbps in 334.172 usecs 1 failures 96: 393.216 KB 748 times --> 9.419 Gbps in 333.994 usecs 10472 failures 97: 393.219 KB 748 times --> 9.413 Gbps in 334.200 usecs 1 failures 98: 524.285 KB 748 times --> 9.499 Gbps in 441.562 usecs 1 failures 99: 524.288 KB 566 times --> 9.504 Gbps in 441.339 usecs 7924 failures 100: 524.291 KB 566 times --> 9.498 Gbps in 441.590 usecs 1 failures 101: 786.429 KB 566 times --> 9.586 Gbps in 656.293 usecs 1 failures 102: 786.432 KB 380 times --> 9.589 Gbps in 656.106 usecs 5320 failures <… cut …> 116: 4.194 MB 96 times --> 9.731 Gbps in 3.448 msecs 1 failures 117: 4.194 MB 72 times --> 9.733 Gbps in 3.448 msecs 1008 failures 118: 4.194 MB 72 times --> 9.727 Gbps in 3.450 msecs 1 failures 119: 6.291 MB 72 times --> 9.740 Gbps in 5.167 msecs 1 failures 120: 6.291 MB 48 times --> 9.744 Gbps in 5.166 msecs 672 failures 121: 6.291 MB 48 times --> 9.736 Gbps in 5.170 msecs 1 failures 122: 8.389 MB 48 times --> 9.744 Gbps in 6.887 msecs 1 failures 123: 8.389 MB 36 times --> 9.750 Gbps in 6.883 msecs 504 failures 124: 8.389 MB 36 times --> 9.733 Gbps in 6.895 msecs 1 failures Completed with max bandwidth 9.736 Gbps 2.910 usecs latency On Mon, Apr 22, 2019 at 3:57 AM Yossi Itigin <yos...@mellanox.com<mailto:yos...@mellanox.com>> wrote: Hi Dave, It may be related to OPAL memory hooks overwriting UCX’s. Can you pls try adding “--mca opal_common_ucx_opal_mem_hooks 1” to mpirun? (In latest OpenMPI and UCX versions, we added a warning if such overwrite happens) --Yossi ---------- Forwarded message --------- From: Dave Turner <drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>> Date: Tue, Apr 16, 2019 at 2:13 PM Subject: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX To: Open MPI Developers <devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> After installing UCX 1.5.0 and OpenMPI 4.0.1 compiled for UCX and without verbs (full details below), my NetPIPE benchmark is reporting message failures for some message sizes above 300 KB. There are no failures when I benchmark with a non-UCX (verbs) version of OpenMPI 4.0.1, and no failures when I test the UCX version with --mca btl tcp,self. These failures show up in testing QDR IB and 40 GbE networks. NetPIPE tests the first and last bytes always, but can do a full integrity test using --integrity that tests all bytes and this shows that no message is being received in the cases of the failures. Details on the system and software installation are below followed by several NetPIPE runs illustrating the errors. This includes a minimal case of 3 ping-pong messages where the middle one shows failures. Let me know if there's any more information you need, or any additional tests I can run. Dave Turner CentOS 7 on Intel processors, QDR IB and 40 GbE tests UCX 1.5.0 installed from the tarball according to the docs on the webpage OpenMPI-4.0.1 configured for verbs with: ./configure F77=ifort FC=ifort --prefix=/homes/daveturner/libs/openmpi-4.0.1-verbs --enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx --enable-ipv6 --with-verbs --with-slurm --disable-dlopen OpenMPI-4.0.1 configured for UCX with: ./configure F77=ifort FC=ifort --prefix=/homes/daveturner/libs/openmpi-4.0.1-ucx --enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx --enable-ipv6 --without-verbs --with-slurm --disable-dlopen --with-ucx=/homes/daveturner/libs/ucx-1.5.0/install NetPIPE compiled with: /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpicc -g -O3 -Wall -lrt -DMPI ./src/netpipe.c ./src/mpi.c -o NPmpi-4.0.1-ucx -I./src (http://netpipe.cs.ksu.edu/<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fnetpipe.cs.ksu.edu%2F&data=02%7C01%7Cyosefe%40mellanox.com%7Ce71afe7500d24a3f77df08d6c758669b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636915579846715775&sdata=8YS5jK%2F35727tSxWGFyHkTvVkk5CbYu%2FVPQOLZow0HQ%3D&reserved=0> compiled with 'make mpi') ************************************************************************************** Normal uni-directional point-to-point test shows errors (testing first and last bytes) for messages over 300 KB. ************************************************************************************** Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames Saving output to np.elf.mpi-4.0.1-ucx-ib Proc 0 is on host elf77 Proc 1 is on host elf78 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 38.000 nsecs Start testing with 7 trials for each message size 1: 1 B 24999 times --> 3.766 Mbps in 2.124 usecs 2: 2 B 117702 times --> 8.386 Mbps in 1.908 usecs <… cut …> 91: 196.611 KB 4025 times --> 25.365 Gbps in 62.011 usecs 92: 262.141 KB 4031 times --> 26.066 Gbps in 80.454 usecs 93: 262.144 KB 3107 times --> 27.495 Gbps in 76.275 usecs 94: 262.147 KB 3277 times --> 27.162 Gbps in 77.210 usecs 95: 393.213 KB 3237 times --> 28.291 Gbps in 111.192 usecs 1 failures 96: 393.216 KB 2248 times --> 28.529 Gbps in 110.265 usecs 31472 failures 97: 393.219 KB 2267 times --> 28.360 Gbps in 110.922 usecs 1 failures 98: 524.285 KB 2253 times --> 28.830 Gbps in 145.483 usecs 1 failures 99: 524.288 KB 1718 times --> 28.869 Gbps in 145.288 usecs 24052 failures 100: 524.291 KB 1720 times --> 29.043 Gbps in 144.417 usecs 1 failures 101: 786.429 KB 1731 times --> 29.451 Gbps in 213.626 usecs 1 failures 102: 786.432 KB 1170 times --> 29.383 Gbps in 214.122 usecs 16380 failures <… cut …> 116: 4.194 MB 302 times --> 30.442 Gbps in 1.102 msecs 1 failures 117: 4.194 MB 226 times --> 30.443 Gbps in 1.102 msecs 3164 failures 118: 4.194 MB 226 times --> 30.342 Gbps in 1.106 msecs 1 failures 119: 6.291 MB 226 times --> 29.276 Gbps in 1.719 msecs 120: 6.291 MB 145 times --> 29.274 Gbps in 1.719 msecs 2030 failures 121: 6.291 MB 145 times --> 29.199 Gbps in 1.724 msecs 122: 8.389 MB 145 times --> 29.012 Gbps in 2.313 msecs 1 failures 123: 8.389 MB 108 times --> 29.046 Gbps in 2.310 msecs 1512 failures 124: 8.389 MB 108 times --> 29.010 Gbps in 2.313 msecs 1 failures Completed with max bandwidth 30.299 Gbps 1.931 usecs latency ************************************************************************************** uni-directional point-to-point test with integrity check just doest 1 test for each message size but tests all bytes, not just first and last bytes. ************************************************************************************** Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx --printhostnames --integrity Proc 0 is on host elf77 Doing a message integrity check instead of measuring performance Proc 1 is on host elf78 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 39.000 nsecs Start testing with 1 trials for each message size 1: 1 B 24999 times --> 0 failures 2: 2 B 110029 times --> 0 failures <… cut …> 92: 262.141 KB 410 times --> 0 failures 93: 262.144 KB 309 times --> 0 failures 94: 262.147 KB 284 times --> 0 failures 95: 393.213 KB 283 times --> 393212 failures 96: 393.216 KB 190 times --> 1180022 failures 97: 393.219 KB 206 times --> 393218 failures 98: 524.285 KB 189 times --> 524284 failures 99: 524.288 KB 143 times --> 1573144 failures 100: 524.291 KB 155 times --> 524290 failures 101: 786.429 KB 143 times --> 786428 failures 102: 786.432 KB 95 times --> 2359480 failures 103: 786.435 KB 103 times --> 786434 failures 104: 1.049 MB 95 times --> 1048572 failures 105: 1.049 MB 72 times --> 3145866 failures 106: 1.049 MB 77 times --> 1048578 failures 107: 1.573 MB 71 times --> 1572860 failures 108: 1.573 MB 48 times --> 4718682 failures 109: 1.573 MB 51 times --> 1572866 failures 110: 2.097 MB 48 times --> 2097148 failures 111: 2.097 MB 36 times --> 6291522 failures 112: 2.097 MB 38 times --> 2097154 failures 113: 3.146 MB 36 times --> 0 failures 114: 3.146 MB 24 times --> 9437226 failures 115: 3.146 MB 25 times --> 0 failures 116: 4.194 MB 24 times --> 4194300 failures 117: 4.194 MB 18 times --> 12582942 failures 118: 4.194 MB 18 times --> 4194306 failures 119: 6.291 MB 18 times --> 6291452 failures 120: 6.291 MB 12 times --> 18874386 failures 121: 6.291 MB 12 times --> 6291458 failures 122: 8.389 MB 12 times --> 8388604 failures 123: 8.389 MB 9 times --> 25165836 failures 124: 8.389 MB 9 times --> 8388610 failures Completed with max bandwidth 2.596 Gbps 2.013 usecs latency ************************************************************************************** minimal uni-directional point-to-point with just 3 messages being passed round trip, then the same with tcp only showing no failures when UCX is not used. ************************************************************************************** Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx --printhostnames --integrity --start 393216 --end 393216 --repeats 1 Proc 0 is on host elf77 Proc 1 is on host elf78 Doing a message integrity check instead of measuring performance Using a constant number of 1 transmissions NOTE: Be leary of timings that are close to the clock accuracy. Clock resolution ~ 1.000 nsecs Clock accuracy ~ 39.000 nsecs Start testing with 1 trials for each message size 1: 393.213 KB 1 times --> 0 failures 2: 393.216 KB 1 times --> 786430 failures 3: 393.219 KB 1 times --> 0 failures Completed with max bandwidth 257.855 Mbps 6.496 msecs latency Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --mca btl tcp,self --hostfile hf.elf NPmpi-4.0.1-ucx --printhostnames --integrity --start 393216 --end 393216 --repeats 1 Proc 0 is on host elf77 Doing a message integrity check instead of measuring performance Using a constant number of 1 transmissions NOTE: Be leary of timings that are close to the clock accuracy. Proc 1 is on host elf78 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 33.000 nsecs Start testing with 1 trials for each message size 1: 393.213 KB 1 times --> 0 failures 2: 393.216 KB 1 times --> 0 failures 3: 393.219 KB 1 times --> 0 failures Completed with max bandwidth 232.044 Mbps 7.004 msecs latency ************************************************************************************** uni-directional point-to-point test with integrity check has no failures when restricted to only factors of 8 bytes. However, the full test with more messages of each size still shows some failures. ************************************************************************************** Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx --printhostnames --integrity --repeats 1 --pert 0 Proc 0 is on host elf77 Doing a message integrity check instead of measuring performance Using a constant number of 1 transmissions NOTE: Be leary of timings that are close to the clock accuracy. Clock resolution ~ 1.000 nsecs Clock accuracy ~ 34.000 nsecs Start testing with 1 trials for each message size 1: 1 B 1 times --> 0 failures Proc 1 is on host elf78 2: 2 B 1 times --> 0 failures 3: 3 B 1 times --> 0 failures 4: 4 B 1 times --> 0 failures <… cut …> 45: 6.291 MB 1 times --> 0 failures 46: 8.389 MB 1 times --> 0 failures Completed with max bandwidth 1.108 Gbps 4.775 usecs latency Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile hf.elf NPmpi-4.0.1-ucx --printhostnames --pert 0 Proc 0 is on host elf77 Proc 1 is on host elf78 Clock resolution ~ 1.000 nsecs Clock accuracy ~ 33.000 nsecs Start testing with 7 trials for each message size 1: 1 B 24999 times --> 3.792 Mbps in 2.110 usecs 2: 2 B 118504 times --> 8.337 Mbps in 1.919 usecs <… cut …> 35: 196.608 KB 5765 times --> 25.415 Gbps in 61.887 usecs 36: 262.144 KB 4039 times --> 27.430 Gbps in 76.454 usecs 37: 393.216 KB 3269 times --> 28.316 Gbps in 111.095 usecs 38: 524.288 KB 2250 times --> 28.794 Gbps in 145.667 usecs 1 failures <… cut …> 45: 6.291 MB 225 times --> 29.272 Gbps in 1.719 msecs 1 failures 46: 8.389 MB 145 times --> 29.010 Gbps in 2.313 msecs 1 failures Completed with max bandwidth 30.112 Gbps 1.953 usecs latency -- Work: davetur...@ksu.edu<mailto:davetur...@ksu.edu> (785) 532-7791 2219 Engineering Hall, Manhattan KS 66506 Home: drdavetur...@gmail.com<mailto:drdavetur...@gmail.com> cell: (785) 770-5929 _______________________________________________ devel mailing list devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cyosefe%40mellanox.com%7Ce71afe7500d24a3f77df08d6c758669b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636915579846715775&sdata=45zRquSsD45SPL%2B10YAn0sh%2F%2ByUsagJwsl0O75FDeB0%3D&reserved=0> -- Work: davetur...@ksu.edu<mailto:davetur...@ksu.edu> (785) 532-7791 2219 Engineering Hall, Manhattan KS 66506 Home: drdavetur...@gmail.com<mailto:drdavetur...@gmail.com> cell: (785) 770-5929
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel