Hello Martin,

It is odd that I already had "PMIX_MCA_gds=hash" and still got the
problem on a beowolf with multiple boxes connected by ethernet.

Since the lack of appropriate "#ifdefs" seems like an oversight on the
part of the openmpi developers, I think it would be appropriate to
push it upstream.  Are you prepared to do that?  I could try, but it
would take me a while to educate myself on this.

Best,

Dave

On 12/19/19, Martin Reindl <[email protected]> wrote:
> [moved to ports@]
>
> On Tue, Dec 17, 2019 at 04:16:25PM -0700, Raymond, David wrote:
>> Martin,
>>
>> I have been using openmpi 4.0.2 on my computer system and I found a
>> bug that is provoked by running a job (a Go program interfaced to the
>> Clang MPI package) on multiple machines connected by ethernet.  This
>> crashes the program with the following output:
> [...]
>>
>> I traced this to the fact that OpenBSD's version of pthreads doesn't
>> have "pthread_mutexattr_setpshared".  It turns out that the
>> configuration file undefines a flag if this is so, but the actual code
>> doesn't pay any attention to this.  I fixed the problem by putting
>> appropriate ifdefs around the code generating the error, which itself
>> is simple error checking code.  This seems to work.  I have attached
>> two patches for the 4.0.2 source.
>
> Hello Dave,
>
> Thanks for your input, I've updated the 4.0.2 diff.
>
> We already were aware of the problem with 4.0.1 back in June and worked
> around the problem by setting PMIX_MCA_gds=hash before execution to avoid
> GDS/ds21 and GDS/12.
>
> Your diff is of course a much better way, do you want to try to push it
> upstream?
>
> -m
>
> Index: Makefile
> ===================================================================
> RCS file: /cvs/ports/devel/openmpi/Makefile,v
> retrieving revision 1.28
> diff -u -p -u -p -r1.28 Makefile
> --- Makefile  28 Jun 2019 11:05:11 -0000      1.28
> +++ Makefile  19 Dec 2019 07:18:30 -0000
> @@ -2,9 +2,8 @@
>
>  COMMENT =            open source MPI-3.1 implementation
>
> -V =                  4.0.1
> +V =                  4.0.2
>  DISTNAME =           openmpi-$V
> -REVISION =           0
>
>  SHARED_LIBS +=  mca_common_dstore         0.0 # 1.0
>  SHARED_LIBS +=  mca_common_monitoring     0.0 # 60.0
> Index: distinfo
> ===================================================================
> RCS file: /cvs/ports/devel/openmpi/distinfo,v
> retrieving revision 1.4
> diff -u -p -u -p -r1.4 distinfo
> --- distinfo  27 Jun 2019 13:52:00 -0000      1.4
> +++ distinfo  19 Dec 2019 07:18:30 -0000
> @@ -1,2 +1,2 @@
> -SHA256 (openmpi-4.0.1.tar.gz) =
> 5V4hP+CaIUq58scirP2L97ObvBgA5LekZNON8V5wf1k=
> -SIZE (openmpi-4.0.1.tar.gz) = 17513706
> +SHA256 (openmpi-4.0.2.tar.gz) =
> ZigFhw6GoUceWXObDDTG+QBODHoi2waFYtU4jsRCGQQ=
> +SIZE (openmpi-4.0.2.tar.gz) = 17373487
> Index:
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds12_gds_ds12_lock_pthread_c
> ===================================================================
> RCS file:
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds12_gds_ds12_lock_pthread_c
> diff -N
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds12_gds_ds12_lock_pthread_c
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds12_gds_ds12_lock_pthread_c
>       19
> Dec 2019 07:18:30 -0000
> @@ -0,0 +1,20 @@
> +$OpenBSD$
> +
> +Index: opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
> +---
> opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c.orig
> ++++ opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds12/gds_ds12_lock_pthread.c
> +@@ -132,12 +132,14 @@ pmix_status_t
> pmix_gds_ds12_lock_init(pmix_common_dsto
> +             PMIX_ERROR_LOG(rc);
> +             goto error;
> +         }
> ++#ifdef HAVE_PTHREAD_SHARED
> +         if (0 != pthread_rwlockattr_setpshared(&attr,
> PTHREAD_PROCESS_SHARED)) {
> +             pthread_rwlockattr_destroy(&attr);
> +             rc = PMIX_ERR_INIT;
> +             PMIX_ERROR_LOG(rc);
> +             goto error;
> +         }
> ++#endif
> + #ifdef HAVE_PTHREAD_SETKIND
> +         if (0 != pthread_rwlockattr_setkind_np(&attr,
> +
> PTHREAD_RWLOCK_PREFER_WRITER_NONRECURSIVE_NP)) {
> Index:
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds21_gds_ds21_lock_pthread_c
> ===================================================================
> RCS file:
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds21_gds_ds21_lock_pthread_c
> diff -N
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds21_gds_ds21_lock_pthread_c
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++
> patches/patch-opal_mca_pmix_pmix3x_pmix_src_mca_gds_ds21_gds_ds21_lock_pthread_c
>       19
> Dec 2019 07:18:30 -0000
> @@ -0,0 +1,21 @@
> +$OpenBSD$
> +
> +Index: opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c
> +---
> opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c.orig
> ++++ opal/mca/pmix/pmix3x/pmix/src/mca/gds/ds21/gds_ds21_lock_pthread.c
> +@@ -182,12 +182,15 @@ pmix_status_t
> pmix_gds_ds21_lock_init(pmix_common_dsto
> +             PMIX_ERROR_LOG(rc);
> +             goto error;
> +         }
> ++
> ++#ifdef HAVE_PTHREAD_MUTEXATTR_SETPSHARED
> +         if (0 != pthread_mutexattr_setpshared(&attr,
> PTHREAD_PROCESS_SHARED)) {
> +             pthread_mutexattr_destroy(&attr);
> +             rc = PMIX_ERR_INIT;
> +             PMIX_ERROR_LOG(rc);
> +             goto error;
> +         }
> ++#endif
> +
> +         segment_hdr_t *seg_hdr =
> (segment_hdr_t*)lock_item->seg_desc->seg_info.seg_base_addr;
> +         seg_hdr->num_locks = local_size;
> Index: pkg/PLIST
> ===================================================================
> RCS file: /cvs/ports/devel/openmpi/pkg/PLIST,v
> retrieving revision 1.5
> diff -u -p -u -p -r1.5 PLIST
> --- pkg/PLIST 27 Jun 2019 13:52:00 -0000      1.5
> +++ pkg/PLIST 19 Dec 2019 07:18:30 -0000
> @@ -143,15 +143,6 @@ lib/openmpi/mca_compress_gzip.so
>  lib/openmpi/mca_crs_none.a
>  lib/openmpi/mca_crs_none.la
>  lib/openmpi/mca_crs_none.so
> -lib/openmpi/mca_dfs_app.a
> -lib/openmpi/mca_dfs_app.la
> -lib/openmpi/mca_dfs_app.so
> -lib/openmpi/mca_dfs_orted.a
> -lib/openmpi/mca_dfs_orted.la
> -lib/openmpi/mca_dfs_orted.so
> -lib/openmpi/mca_dfs_test.a
> -lib/openmpi/mca_dfs_test.la
> -lib/openmpi/mca_dfs_test.so
>  lib/openmpi/mca_errmgr_default_app.a
>  lib/openmpi/mca_errmgr_default_app.la
>  lib/openmpi/mca_errmgr_default_app.so
> @@ -221,9 +212,6 @@ lib/openmpi/mca_iof_tool.so
>  lib/openmpi/mca_mpool_hugepage.a
>  lib/openmpi/mca_mpool_hugepage.la
>  lib/openmpi/mca_mpool_hugepage.so
> -lib/openmpi/mca_notifier_syslog.a
> -lib/openmpi/mca_notifier_syslog.la
> -lib/openmpi/mca_notifier_syslog.so
>  lib/openmpi/mca_odls_default.a
>  lib/openmpi/mca_odls_default.la
>  lib/openmpi/mca_odls_default.so
> @@ -288,6 +276,9 @@ lib/openmpi/mca_reachable_weighted.so
>  lib/openmpi/mca_regx_fwd.a
>  lib/openmpi/mca_regx_fwd.la
>  lib/openmpi/mca_regx_fwd.so
> +lib/openmpi/mca_regx_naive.a
> +lib/openmpi/mca_regx_naive.la
> +lib/openmpi/mca_regx_naive.so
>  lib/openmpi/mca_regx_reverse.a
>  lib/openmpi/mca_regx_reverse.la
>  lib/openmpi/mca_regx_reverse.so
> @@ -315,9 +306,6 @@ lib/openmpi/mca_rml_oob.so
>  lib/openmpi/mca_routed_binomial.a
>  lib/openmpi/mca_routed_binomial.la
>  lib/openmpi/mca_routed_binomial.so
> -lib/openmpi/mca_routed_debruijn.a
> -lib/openmpi/mca_routed_debruijn.la
> -lib/openmpi/mca_routed_debruijn.so
>  lib/openmpi/mca_routed_direct.a
>  lib/openmpi/mca_routed_direct.la
>  lib/openmpi/mca_routed_direct.so
>


-- 
David J. Raymond
[email protected]
http://physics.nmt.edu/~raymond

Reply via email to