Hi,

(resending this edited email due to 150kB limit on email message size)

The issue below is caused by btl_uct – it mistakenly calls 
ucm_set_external_event() without checking opal_mem_hooks_support_level().
This leads UCX to believe that memory hooks would be provided by OMPI, but in 
fact they are not, so pinned physical pages become out-of-sync with process 
virtual address.

  *   btl_uct wrong call: 
https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/uct/btl_uct_component.c#L132
  *   Correct way: 
https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.c#L104

Since btl_uct component currently does have a maintainer, my best suggestion is 
to either disable it in OMPI configure (as described in 
https://github.com/open-mpi/ompi/issues/6640#issuecomment-490465625), or during 
runtime : “mpirun -mca btl ^uct …”

UCX reference issue: https://github.com/openucx/ucx/issues/3581

--Yossi

From: Dave Turner <drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>>
Sent: Monday, April 22, 2019 10:26 PM
To: Yossi Itigin <yos...@mellanox.com<mailto:yos...@mellanox.com>>
Cc: Pavel Shamis <pasharesea...@gmail.com<mailto:pasharesea...@gmail.com>>; 
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>; Sergey Oblomov 
<serg...@mellanox.com<mailto:serg...@mellanox.com>>
Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX

Yossi,

    I reran the base NetPIPE test then added --mca 
opal_common_ucx_opal_mem_hooks 1 as
you suggested but got the same failures and no warning messages.  Let me know if
there is anything else you'd like me to try.

                          Dave Turner

Elf22 module purge
Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile 
hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames
Saving output to np.elf.mpi-4.0.1-ucx-ib

Proc 0 is on host elf22

Proc 1 is on host elf23

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  33.000 nsecs

Start testing with 7 trials for each message size
  1:       1  B     24999 times -->    2.754 Mbps  in    2.904 usecs
  2:       2  B     86077 times -->    5.536 Mbps  in    2.890 usecs
  <… cut …>
 91: 196.611 KB      1465 times -->    9.222 Gbps  in  170.561 usecs
 92: 262.141 KB      1465 times -->    9.352 Gbps  in  224.245 usecs
 93: 262.144 KB      1114 times -->    9.246 Gbps  in  226.826 usecs
 94: 262.147 KB      1102 times -->    9.243 Gbps  in  226.883 usecs
 95: 393.213 KB      1101 times -->    9.413 Gbps  in  334.177 usecs   1 
failures
 96: 393.216 KB       748 times -->    9.418 Gbps  in  334.005 usecs   10472 
failures
 97: 393.219 KB       748 times -->    9.413 Gbps  in  334.201 usecs   1 
failures
 98: 524.285 KB       748 times -->    9.498 Gbps  in  441.601 usecs   1 
failures
   <… cut …>
120:   6.291 MB        48 times -->    9.744 Gbps  in    5.166 msecs   672 
failures
121:   6.291 MB        48 times -->    9.736 Gbps  in    5.170 msecs   1 
failures
122:   8.389 MB        48 times -->    9.744 Gbps  in    6.887 msecs   1 
failures
123:   8.389 MB        36 times -->    9.750 Gbps  in    6.883 msecs   504 
failures
124:   8.389 MB        36 times -->    9.739 Gbps  in    6.891 msecs   1 
failures

Completed with        max bandwidth    9.737 Gbps        2.908 usecs latency


Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --mca 
opal_common_ucx_opal_mem_hooks 1 --hostfile hf.elf NPmpi-4.0.1-ucx -o 
np.elf.mpi-4.0.1-ucx-ib --printhostnames
Saving output to np.elf.mpi-4.0.1-ucx-ib

Proc 0 is on host elf22

Proc 1 is on host elf23

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  34.000 nsecs

Start testing with 7 trials for each message size
  1:       1  B     24999 times -->    2.750 Mbps  in    2.909 usecs
  2:       2  B     85939 times -->    5.527 Mbps  in    2.895 usecs
  <… cut …>
 87: 131.072 KB      2142 times -->    8.986 Gbps  in  116.693 usecs
 88: 131.075 KB      2142 times -->    8.987 Gbps  in  116.683 usecs
 89: 196.605 KB      2142 times -->    9.220 Gbps  in  170.584 usecs
 90: 196.608 KB      1465 times -->    9.221 Gbps  in  170.577 usecs
 91: 196.611 KB      1465 times -->    9.222 Gbps  in  170.550 usecs
 92: 262.141 KB      1465 times -->    9.352 Gbps  in  224.250 usecs
 93: 262.144 KB      1114 times -->    9.246 Gbps  in  226.805 usecs
 94: 262.147 KB      1102 times -->    9.244 Gbps  in  226.860 usecs
 95: 393.213 KB      1102 times -->    9.413 Gbps  in  334.172 usecs   1 
failures
 96: 393.216 KB       748 times -->    9.419 Gbps  in  333.994 usecs   10472 
failures
 97: 393.219 KB       748 times -->    9.413 Gbps  in  334.200 usecs   1 
failures
 98: 524.285 KB       748 times -->    9.499 Gbps  in  441.562 usecs   1 
failures
 99: 524.288 KB       566 times -->    9.504 Gbps  in  441.339 usecs   7924 
failures
100: 524.291 KB       566 times -->    9.498 Gbps  in  441.590 usecs   1 
failures
101: 786.429 KB       566 times -->    9.586 Gbps  in  656.293 usecs   1 
failures
102: 786.432 KB       380 times -->    9.589 Gbps  in  656.106 usecs   5320 
failures
  <… cut …>
116:   4.194 MB        96 times -->    9.731 Gbps  in    3.448 msecs   1 
failures
117:   4.194 MB        72 times -->    9.733 Gbps  in    3.448 msecs   1008 
failures
118:   4.194 MB        72 times -->    9.727 Gbps  in    3.450 msecs   1 
failures
119:   6.291 MB        72 times -->    9.740 Gbps  in    5.167 msecs   1 
failures
120:   6.291 MB        48 times -->    9.744 Gbps  in    5.166 msecs   672 
failures
121:   6.291 MB        48 times -->    9.736 Gbps  in    5.170 msecs   1 
failures
122:   8.389 MB        48 times -->    9.744 Gbps  in    6.887 msecs   1 
failures
123:   8.389 MB        36 times -->    9.750 Gbps  in    6.883 msecs   504 
failures
124:   8.389 MB        36 times -->    9.733 Gbps  in    6.895 msecs   1 
failures

Completed with        max bandwidth    9.736 Gbps        2.910 usecs latency



On Mon, Apr 22, 2019 at 3:57 AM Yossi Itigin 
<yos...@mellanox.com<mailto:yos...@mellanox.com>> wrote:
Hi Dave,

It may be related to OPAL memory hooks overwriting UCX’s.
Can you pls try adding “--mca opal_common_ucx_opal_mem_hooks 1” to mpirun?
(In latest OpenMPI and UCX versions, we added a warning if such overwrite 
happens)

--Yossi

---------- Forwarded message ---------
From: Dave Turner <drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>>
Date: Tue, Apr 16, 2019 at 2:13 PM
Subject: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX
To: Open MPI Developers 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>>


    After installing UCX 1.5.0 and OpenMPI 4.0.1 compiled for UCX and without 
verbs
(full details below), my NetPIPE benchmark is reporting message failures for
some message sizes above 300 KB.  There are no failures when I benchmark with
a non-UCX (verbs) version of OpenMPI 4.0.1, and no failures when I test the UCX 
version
with --mca btl tcp,self.  These failures show up in testing QDR IB and 40 GbE 
networks.
NetPIPE tests the first and last bytes always, but can do a full integrity test
using --integrity that tests all bytes and this shows that no message is being
received in the cases of the failures.

    Details on the system and software installation are below followed by
several NetPIPE runs illustrating the errors.  This includes a minimal
case of 3 ping-pong messages where the middle one shows failures.  Let me
know if there's any more information you need, or any additional tests I
can run.

                 Dave Turner



CentOS 7 on Intel processors, QDR IB and 40 GbE tests

UCX 1.5.0 installed from the tarball according to the docs on the webpage

OpenMPI-4.0.1 configured for verbs with:

./configure F77=ifort FC=ifort 
--prefix=/homes/daveturner/libs/openmpi-4.0.1-verbs 
--enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx 
--enable-ipv6 --with-verbs --with-slurm --disable-dlopen

OpenMPI-4.0.1 configured for UCX  with:

./configure F77=ifort FC=ifort 
--prefix=/homes/daveturner/libs/openmpi-4.0.1-ucx 
--enable-mpirun-prefix-by-default --enable-mpi-fortran=all --enable-mpi-cxx 
--enable-ipv6 --without-verbs --with-slurm --disable-dlopen 
--with-ucx=/homes/daveturner/libs/ucx-1.5.0/install

NetPIPE compiled with:

/homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpicc -g -O3 -Wall -lrt -DMPI  
./src/netpipe.c ./src/mpi.c -o NPmpi-4.0.1-ucx -I./src

(http://netpipe.cs.ksu.edu/<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fnetpipe.cs.ksu.edu%2F&data=02%7C01%7Cyosefe%40mellanox.com%7Ce71afe7500d24a3f77df08d6c758669b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636915579846715775&sdata=8YS5jK%2F35727tSxWGFyHkTvVkk5CbYu%2FVPQOLZow0HQ%3D&reserved=0>
  compiled with 'make mpi')



**************************************************************************************
Normal uni-directional point-to-point test shows errors (testing first and last 
bytes)
for messages over 300 KB.
**************************************************************************************

Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile 
hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames
Saving output to np.elf.mpi-4.0.1-ucx-ib

Proc 0 is on host elf77

Proc 1 is on host elf78

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  38.000 nsecs

Start testing with 7 trials for each message size
  1:       1  B     24999 times -->    3.766 Mbps  in    2.124 usecs
  2:       2  B    117702 times -->    8.386 Mbps  in    1.908 usecs
  <… cut …>
 91: 196.611 KB      4025 times -->   25.365 Gbps  in   62.011 usecs
 92: 262.141 KB      4031 times -->   26.066 Gbps  in   80.454 usecs
 93: 262.144 KB      3107 times -->   27.495 Gbps  in   76.275 usecs
 94: 262.147 KB      3277 times -->   27.162 Gbps  in   77.210 usecs
 95: 393.213 KB      3237 times -->   28.291 Gbps  in  111.192 usecs   1 
failures
 96: 393.216 KB      2248 times -->   28.529 Gbps  in  110.265 usecs   31472 
failures
 97: 393.219 KB      2267 times -->   28.360 Gbps  in  110.922 usecs   1 
failures
 98: 524.285 KB      2253 times -->   28.830 Gbps  in  145.483 usecs   1 
failures
 99: 524.288 KB      1718 times -->   28.869 Gbps  in  145.288 usecs   24052 
failures
100: 524.291 KB      1720 times -->   29.043 Gbps  in  144.417 usecs   1 
failures
101: 786.429 KB      1731 times -->   29.451 Gbps  in  213.626 usecs   1 
failures
102: 786.432 KB      1170 times -->   29.383 Gbps  in  214.122 usecs   16380 
failures
  <… cut …>
116:   4.194 MB       302 times -->   30.442 Gbps  in    1.102 msecs   1 
failures
117:   4.194 MB       226 times -->   30.443 Gbps  in    1.102 msecs   3164 
failures
118:   4.194 MB       226 times -->   30.342 Gbps  in    1.106 msecs   1 
failures
119:   6.291 MB       226 times -->   29.276 Gbps  in    1.719 msecs
120:   6.291 MB       145 times -->   29.274 Gbps  in    1.719 msecs   2030 
failures
121:   6.291 MB       145 times -->   29.199 Gbps  in    1.724 msecs
122:   8.389 MB       145 times -->   29.012 Gbps  in    2.313 msecs   1 
failures
123:   8.389 MB       108 times -->   29.046 Gbps  in    2.310 msecs   1512 
failures
124:   8.389 MB       108 times -->   29.010 Gbps  in    2.313 msecs   1 
failures

Completed with        max bandwidth   30.299 Gbps        1.931 usecs latency



**************************************************************************************
uni-directional point-to-point test with integrity check just doest 1 test for 
each
message size but tests all bytes, not just first and last bytes.
**************************************************************************************

Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile 
hf.elf NPmpi-4.0.1-ucx  --printhostnames --integrity
Proc 0 is on host elf77

Doing a message integrity check instead of measuring performance
Proc 1 is on host elf78

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  39.000 nsecs

Start testing with 1 trials for each message size
  1:       1  B     24999 times -->          0 failures
  2:       2  B    110029 times -->          0 failures
  <… cut …>
 92: 262.141 KB       410 times -->          0 failures
 93: 262.144 KB       309 times -->          0 failures
 94: 262.147 KB       284 times -->          0 failures
 95: 393.213 KB       283 times -->     393212 failures
 96: 393.216 KB       190 times -->    1180022 failures
 97: 393.219 KB       206 times -->     393218 failures
 98: 524.285 KB       189 times -->     524284 failures
 99: 524.288 KB       143 times -->    1573144 failures
100: 524.291 KB       155 times -->     524290 failures
101: 786.429 KB       143 times -->     786428 failures
102: 786.432 KB        95 times -->    2359480 failures
103: 786.435 KB       103 times -->     786434 failures
104:   1.049 MB        95 times -->    1048572 failures
105:   1.049 MB        72 times -->    3145866 failures
106:   1.049 MB        77 times -->    1048578 failures
107:   1.573 MB        71 times -->    1572860 failures
108:   1.573 MB        48 times -->    4718682 failures
109:   1.573 MB        51 times -->    1572866 failures
110:   2.097 MB        48 times -->    2097148 failures
111:   2.097 MB        36 times -->    6291522 failures
112:   2.097 MB        38 times -->    2097154 failures
113:   3.146 MB        36 times -->          0 failures
114:   3.146 MB        24 times -->    9437226 failures
115:   3.146 MB        25 times -->          0 failures
116:   4.194 MB        24 times -->    4194300 failures
117:   4.194 MB        18 times -->   12582942 failures
118:   4.194 MB        18 times -->    4194306 failures
119:   6.291 MB        18 times -->    6291452 failures
120:   6.291 MB        12 times -->   18874386 failures
121:   6.291 MB        12 times -->    6291458 failures
122:   8.389 MB        12 times -->    8388604 failures
123:   8.389 MB         9 times -->   25165836 failures
124:   8.389 MB         9 times -->    8388610 failures

Completed with        max bandwidth    2.596 Gbps        2.013 usecs latency


**************************************************************************************
minimal uni-directional point-to-point with just 3 messages being passed
round trip, then the same with tcp only showing no failures when UCX is not 
used.
**************************************************************************************


Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile 
hf.elf NPmpi-4.0.1-ucx  --printhostnames --integrity --start 393216 --end 
393216 --repeats 1
Proc 0 is on host elf77
Proc 1 is on host elf78

Doing a message integrity check instead of measuring performance
Using a constant number of 1 transmissions
NOTE: Be leary of timings that are close to the clock accuracy.

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  39.000 nsecs

Start testing with 1 trials for each message size
  1: 393.213 KB         1 times -->          0 failures
  2: 393.216 KB         1 times -->     786430 failures
  3: 393.219 KB         1 times -->          0 failures

Completed with        max bandwidth  257.855 Mbps        6.496 msecs latency


Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --mca btl 
tcp,self --hostfile hf.elf NPmpi-4.0.1-ucx  --printhostnames --integrity 
--start 393216 --end 393216 --repeats 1
Proc 0 is on host elf77

Doing a message integrity check instead of measuring performance
Using a constant number of 1 transmissions
NOTE: Be leary of timings that are close to the clock accuracy.
Proc 1 is on host elf78

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  33.000 nsecs

Start testing with 1 trials for each message size
  1: 393.213 KB         1 times -->          0 failures
  2: 393.216 KB         1 times -->          0 failures
  3: 393.219 KB         1 times -->          0 failures

Completed with        max bandwidth  232.044 Mbps        7.004 msecs latency



**************************************************************************************
uni-directional point-to-point test with integrity check has no failures when
restricted to only factors of 8 bytes.  However, the full test with more 
messages
of each size still shows some failures.
**************************************************************************************


Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile 
hf.elf NPmpi-4.0.1-ucx  --printhostnames --integrity --repeats 1 --pert 0
Proc 0 is on host elf77

Doing a message integrity check instead of measuring performance
Using a constant number of 1 transmissions
NOTE: Be leary of timings that are close to the clock accuracy.

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  34.000 nsecs

Start testing with 1 trials for each message size
  1:       1  B         1 times -->          0 failures
Proc 1 is on host elf78
  2:       2  B         1 times -->          0 failures
  3:       3  B         1 times -->          0 failures
  4:       4  B         1 times -->          0 failures
  <… cut …>
 45:   6.291 MB         1 times -->          0 failures
 46:   8.389 MB         1 times -->          0 failures

Completed with        max bandwidth    1.108 Gbps        4.775 usecs latency


Elf77 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile 
hf.elf NPmpi-4.0.1-ucx  --printhostnames --pert 0
Proc 0 is on host elf77

Proc 1 is on host elf78

      Clock resolution ~   1.000 nsecs      Clock accuracy ~  33.000 nsecs

Start testing with 7 trials for each message size
  1:       1  B     24999 times -->    3.792 Mbps  in    2.110 usecs
  2:       2  B    118504 times -->    8.337 Mbps  in    1.919 usecs
  <… cut …>
 35: 196.608 KB      5765 times -->   25.415 Gbps  in   61.887 usecs
 36: 262.144 KB      4039 times -->   27.430 Gbps  in   76.454 usecs
 37: 393.216 KB      3269 times -->   28.316 Gbps  in  111.095 usecs
 38: 524.288 KB      2250 times -->   28.794 Gbps  in  145.667 usecs   1 
failures
  <… cut …>
 45:   6.291 MB       225 times -->   29.272 Gbps  in    1.719 msecs   1 
failures
 46:   8.389 MB       145 times -->   29.010 Gbps  in    2.313 msecs   1 
failures

Completed with        max bandwidth   30.112 Gbps        1.953 usecs latency


--
Work:     davetur...@ksu.edu<mailto:davetur...@ksu.edu>     (785) 532-7791
             2219 Engineering Hall, Manhattan KS  66506
Home:    drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>
              cell: (785) 770-5929
_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cyosefe%40mellanox.com%7Ce71afe7500d24a3f77df08d6c758669b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636915579846715775&sdata=45zRquSsD45SPL%2B10YAn0sh%2F%2ByUsagJwsl0O75FDeB0%3D&reserved=0>


--
Work:     davetur...@ksu.edu<mailto:davetur...@ksu.edu>     (785) 532-7791
             2219 Engineering Hall, Manhattan KS  66506
Home:    drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>
              cell: (785) 770-5929
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to