I've rerun my NetPIPE tests using --mca btl ^uct as Yossi suggested
and that
does indeed get rid of the message failures.  I don't see any difference in
performance but wanted to check if there is any downside to doing the build
without uct as suggested.

                     Dave Turner


>
> Today's Topics:
>
>    1. Re: Seeing message failures in OpenMPI 4.0.1 on UCX (Yossi Itigin)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 15 May 2019 10:57:49 +0000
> From: Yossi Itigin <yos...@mellanox.com>
> To: "drdavetur...@gmail.com" <drdavetur...@gmail.com>
> Cc: Pavel Shamis <pasharesea...@gmail.com>, "devel@lists.open-mpi.org"
>         <devel@lists.open-mpi.org>, Sergey Oblomov <serg...@mellanox.com>
> Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on
>         UCX
> Message-ID:
>         <
> am0pr05mb6115e809ad58f76189805b99a6...@am0pr05mb6115.eurprd05.prod.outlook.com
> >
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi,
>
> (resending this edited email due to 150kB limit on email message size)
>
> The issue below is caused by btl_uct ? it mistakenly calls
> ucm_set_external_event() without checking opal_mem_hooks_support_level().
> This leads UCX to believe that memory hooks would be provided by OMPI, but
> in fact they are not, so pinned physical pages become out-of-sync with
> process virtual address.
>
>   *   btl_uct wrong call:
> https://github.com/open-mpi/ompi/blob/master/opal/mca/btl/uct/btl_uct_component.c#L132
>   *   Correct way:
> https://github.com/open-mpi/ompi/blob/master/opal/mca/common/ucx/common_ucx.c#L104
>
> Since btl_uct component currently does have a maintainer, my best
> suggestion is to either disable it in OMPI configure (as described in
> https://github.com/open-mpi/ompi/issues/6640#issuecomment-490465625), or
> during runtime : ?mpirun -mca btl ^uct ??
>
> UCX reference issue: https://github.com/openucx/ucx/issues/3581
>
> --Yossi
>
> From: Dave Turner <drdavetur...@gmail.com<mailto:drdavetur...@gmail.com>>
> Sent: Monday, April 22, 2019 10:26 PM
> To: Yossi Itigin <yos...@mellanox.com<mailto:yos...@mellanox.com>>
> Cc: Pavel Shamis <pasharesea...@gmail.com<mailto:pasharesea...@gmail.com>>;
> devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>; Sergey Oblomov
> <serg...@mellanox.com<mailto:serg...@mellanox.com>>
> Subject: Re: [OMPI devel] Seeing message failures in OpenMPI 4.0.1 on UCX
>
> Yossi,
>
>     I reran the base NetPIPE test then added --mca
> opal_common_ucx_opal_mem_hooks 1 as
> you suggested but got the same failures and no warning messages.  Let me
> know if
> there is anything else you'd like me to try.
>
>                           Dave Turner
>
> Elf22 module purge
> Elf22 /homes/daveturner/libs/openmpi-4.0.1-ucx/bin/mpirun -np 2 --hostfile
> hf.elf NPmpi-4.0.1-ucx -o np.elf.mpi-4.0.1-ucx-ib --printhostnames
> Saving output to np.elf.mpi-4.0.1-ucx-ib
>
> Proc 0 is on host elf22
>
> Proc 1 is on host elf23
>
>       Clock resolution ~   1.000 nsecs      Clock accuracy ~  33.000 nsecs
>
> Start testing with 7 trials for each message size
>   1:       1  B     24999 times -->    2.754 Mbps  in    2.904 usecs
>   2:       2  B     86077 times -->    5.536 Mbps  in    2.890 usecs
>   <? cut ?>
>  91: 196.611 KB      1465 times -->    9.222 Gbps  in  170.561 usecs
>  92: 262.141 KB      1465 times -->    9.352 Gbps  in  224.245 usecs
>  93: 262.144 KB      1114 times -->    9.246 Gbps  in  226.826 usecs
>  94: 262.147 KB      1102 times -->    9.243 Gbps  in  226.883 usecs
>  95: 393.213 KB      1101 times -->    9.413 Gbps  in  334.177 usecs   1
> failures
>  96: 393.216 KB       748 times -->    9.418 Gbps  in  334.005 usecs
>  10472 failures
>  97: 393.219 KB       748 times -->    9.413 Gbps  in  334.201 usecs   1
> failures
>  98: 524.285 KB       748 times -->    9.498 Gbps  in  441.601 usecs   1
> failures
>    <? cut ?>
> 120:   6.291 MB        48 times -->    9.744 Gbps  in    5.166 msecs   672
> failures
> 121:   6.291 MB        48 times -->    9.736 Gbps  in    5.170 msecs   1
> failures
> 122:   8.389 MB        48 times -->    9.744 Gbps  in    6.887 msecs   1
> failures
> 123:   8.389 MB        36 times -->    9.750 Gbps  in    6.883 msecs   504
> failures
> 124:   8.389 MB        36 times -->    9.739 Gbps  in    6.891 msecs   1
> failures
>
> Completed with        max bandwidth    9.737 Gbps        2.908 usecs
> latency
>
>
> Work:     davetur...@ksu.edu     (785) 532-7791
             2219 Engineering Hall, Manhattan KS  66506
Home:    drdavetur...@gmail.com
              cell: (785) 770-5929
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to