Re: [OMPI devel] [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux
un -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:12 ./spawn_master this does not work mpirun -np 1 --host motomachi ./spawn_master # not enough slots available, aborts with a user friendly error message mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi ./spawn_master # various errors sm_segment_attach() fails, a task crashes and this ends up with the following error message At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[15519,2],0]) is on host: motomachi Process 2 ([[15519,2],1]) is on host: unknown! BTLs attempted: self tcp mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:1 ./spawn_master # same error as above mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:2 ./spawn_master # same error as above for the record, the following command surprisingly works mpirun -np 1 --slot-list 0:0-5,1:0-5 --host motomachi:3 --mca btl tcp,self ./spawn_master bottom line, my guess is that when the user specifies the --slot-list and the --host options *and* there are no default slot numbers to hosts, we should default to using the number of slots from the slot list. (e.g. in this case, defaults to --host motomachi:12 instead of (i guess) --host motomachi:1) /* fwiw, i made https://github.com/open-mpi/ompi/pull/2715 <https://github.com/open-mpi/ompi/pull/2715> https://github.com/open-mpi/ompi/pull/2715 <https://github.com/open-mpi/ompi/pull/2715> but these are not the root cause */ Cheers, Gilles Forwarded Message Subject:Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux Date: Wed, 11 Jan 2017 20:39:02 +0900 From: Gilles Gouaillardet <gilles.gouaillar...@gmail.com> <mailto:gilles.gouaillar...@gmail.com> Reply-To: Open MPI Users <us...@lists.open-mpi.org> <mailto:us...@lists.open-mpi.org> To: Open MPI Users <us...@lists.open-mpi.org> <mailto:us...@lists.open-mpi.org> Siegmar, Your slot list is correct. An invalid slot list for your node would be 0:1-7,1:0-7 /* and since the test requires only 5 tasks, that could even work with such an invalid list. My vm is single socket with 4 cores, so a 0:0-4 slot list results in an unfriendly pmix error */ Bottom line, your test is correct, and there is a bug in v2.0.x that I will investigate from tomorrow Cheers, Gilles On Wednesday, January 11, 2017, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>> wrote: Hi Gilles, thank you very much for your help. What does incorrect slot list mean? My machine has two 6-core processors so that I specified "--slot-list 0:0-5,1:0-5". Does incorrect mean that it isn't allowed to specify more slots than available, to specify fewer slots than available, or to specify more slots than needed for the processes? Kind regards Siegmar Am 11.01.2017 um 10:04 schrieb Gilles Gouaillardet: Siegmar, I was able to reproduce the issue on my vm (No need for a real heterogeneous cluster here) I will keep digging tomorrow. Note that if you specify an incorrect slot list, MPI_Comm_spawn fails with a very unfriendly error message. Right now, the 4th spawn'ed task crashes, so this is a different issue Cheers, Gilles r...@open-mpi.org wrote: I think there is some relevant discussion here: https://github.com/open-mpi/ompi/issues/1569 <https://github.com/open-mpi/ompi/issues/1569> It looks like Gilles had (at least at one point) a fix for master when enable-heterogeneous, but I don’t know if that was committed. On Jan 9, 2017, at 8:23 AM, Howard Pritchard <hpprit...@gmail.com <mailto:hpprit...@gmail.com>> wrote: HI Siegmar, You have some config parameters I wasn't trying that may have some impact. I'll give a try with these parameters. This should be enough info for now, Thanks, Howard 2017-01-09 0:59 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>: Hi Howard, I use the following commands to build and insta
Re: [OMPI devel] v2.0.1 PRs: open season
Hi Jeff, I didn't find the PR for my problem in your list and I'm waiting for a solution. https://github.com/open-mpi/ompi/issues/1573 Kind regards and thank you very much for any help in advance Siegmar Am 15.07.2016 um 16:15 schrieb Jeff Squyres (jsquyres): v2.0.1 is officially open to accept PRs. Please note that many v2.0.1 PRs still need reviews: - 36 open v2.0.1 PRs - only 13 have reviews Please start getting reviews for your v2.0.1 PRs -- no review, no merge: https://github.com/open-mpi/ompi-release/pulls?utf8=%E2%9C%93=is%3Apr%20is%3Aopen%20milestone%3Av2.0.1 Also, some of the PRs are a little old -- I just kicked off CI on PRs that hadn't had a CI run in the past week (although the Mellanox Jenkins looks like it might be failing tests due to a local issue -- hopefully we can get that fixed up shortly).
Re: [OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Hi, at first thank you very much for your help. 1st patch: > Can you apply the following patch to a trunk tarball and see if it works > for you? 2nd patch: > Found the problem. Was accessing a boolean variable using intval. That > is a bug that has gone unnoticed on all platforms but thankfully Solaris > caught it. > > Please try the attached patch. I applied both patches manually to openmpi-1.9a1r29972, because my patch program couldn't use the patches. Unfortunately I still get a Bus Error. Hopefully I didn't make a mistake applying your patches. Therefore I show you a "diff" for my files. By the way, I tried to apply your patches with "patch -b -i ". Is it necessary to use a different command? tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* 1685,1689c1685mbv_type) { mbv_enumerator->string_from_value(var->mbv_enumerator, value->boolval, ); <} else { mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, ); <} --- > ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, value->intval, ); tyr openmpi-1.9a1r29972 163 tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* 267,271c267,268 < struct sockaddr_in inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; > const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; 274,275c271,272 < if((inaddr1.sin_addr.s_addr & netmask) == <(inaddr2.sin_addr.s_addr & netmask)) { --- > if((inaddr1->sin_addr.s_addr & netmask) == >(inaddr2->sin_addr.s_addr & netmask)) { 284,290c281,284 < struct sockaddr_in6 inaddr1, inaddr2; < /* Use temporary variables and memcpy's so that we don't const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; > const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; > struct in6_addr *a6_1 = (struct in6_addr*) >sin6_addr; > struct in6_addr *a6_2 = (struct in6_addr*) >sin6_addr; tyr openmpi-1.9a1r29972 167 Now my debug information. tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info For information about new features see `help changes' To remove this message, put `dbxenv suppress_startup_message 7.9' in your .dbxrc Reading ompi_info Reading ld.so.1 Reading libmpi.so.0.0.0 Reading libopen-rte.so.0.0.0 Reading libopen-pal.so.0.0.0 Reading libsendfile.so.1 Reading libpicl.so.1 Reading libkstat.so.1 Reading liblgrp.so.1 Reading libsocket.so.1 Reading libnsl.so.1 Reading librt.so.1 Reading libm.so.2 Reading libthread.so.1 Reading libc.so.1 Reading libdoor.so.1 Reading libaio.so.1 Reading libmd.so.1 (dbx) run -a Running: ompi_info -a (process id 10998) Reading libc_psr.so.1 ... MCA compress: parameter "compress_base_verbose" (current value: "-1", data source: default, level: 8 dev/detail, type: int) Verbosity level for the compress framework (0 = no verbosity) t@1 (l@1) signal BUS (invalid address alignment) in var_value_string at line 1680 in file "mca_base_var.c" 1680 ret = asprintf (value_string, var_type_formats[var->mbv_type], value[0]); (dbx) (dbx) (dbx) check -all dbx: warning: check -all will be turned on in the next run of the process access checking - OFF memuse checking - OFF (dbx) run -a Running: ompi_info -a (process id 11000) Reading rtcapihook.so Reading libdl.so.1 Reading rtcaudit.so Reading libmapmalloc.so.1 Reading rtcboot.so Reading librtc.so Reading libmd_psr.so.1 RTC: Enabling Error Checking... RTC: Using UltraSparc trap mechanism RTC: See `help rtc showmap' and `help rtc limitations' for details. RTC: Running program... Read from uninitialized (rui) on thread 1: Attempting to read 4 bytes at address 0x7fffd5f8 which is 184 bytes above the current stack pointer Variable is 'index' t@1 (l@1) stopped in
[OMPI devel] Bus error with openmpi-1.7.4rc1 on Solaris
Hi, today I installed openmpi-1.7.4rc1 on Solaris 10 Sparc with Sun C 5.12. Unfortunately my problems with bus errors, which I reported December 4th for openmpi-1.7.4a1r29784 at us...@open-mpi.org, are not solved yet. Has somebody time to look into that matter or is Solaris support abandoned, so that I have to stay with openmpi-1.6.x in the future? Thank you very much for any help in advance. Kind regards Siegmar
Re: [OMPI devel] [OMPI users] Error in openmpi-1.9a1r29179
Hello Josh, thank you very much for your help. Unfortunately I have still a problem to build Open MPI. > I pushed a bunch of fixes, can you please try now. I tried to build /openmpi-1.9a1r29197 on my platforms and now I get on all platforms the following error. linpc1 openmpi-1.9a1r29197-Linux.x86_64.64_cc 117 tail -22 log.make.Linux.x86_64.64_cc CC base/memheap_base_alloc.lo "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 136: warning: parameter in inline asm statement unused: %3 "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 182: warning: parameter in inline asm statement unused: %2 "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 203: warning: parameter in inline asm statement unused: %2 "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 224: warning: parameter in inline asm statement unused: %2 "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/amd64/atomic.h", line 245: warning: parameter in inline asm statement unused: %2 "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/atomic_impl.h", line 167: warning: statement not reached "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/atomic_impl.h", line 192: warning: statement not reached "../../../../openmpi-1.9a1r29197/opal/include/opal/sys/atomic_impl.h", line 217: warning: statement not reached "../../../../openmpi-1.9a1r29197/oshmem/mca/spml/spml.h", line 76: warning: anonymous union declaration "../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", line 112: warning: argument mismatch "../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", line 119: warning: argument mismatch "../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", line 124: warning: argument mismatch "../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", line 248: warning: pointer to void or function used in arithmetic "../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", line 286: syntax error before or at: | "../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c", line 300: warning: pointer to void or function used in arithmetic cc: acomp failed for ../../../../openmpi-1.9a1r29197/oshmem/mca/memheap/base/memheap_base_alloc.c make[2]: *** [base/memheap_base_alloc.lo] Error 1 make[2]: Leaving directory `/export2/src/openmpi-1.9/openmpi-1.9a1r29197-Linux.x86_64.64_cc/oshmem/mca/memheap' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/export2/src/openmpi-1.9/openmpi-1.9a1r29197-Linux.x86_64.64_cc/oshmem' make: *** [all-recursive] Error 1 Kind regards Siegmar > -Original Message- > From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com] > Sent: Tuesday, September 17, 2013 6:37 AM > To: Siegmar Gross; Open MPI Developers List > Cc: Joshua Ladd > Subject: Re: [OMPI users] Error in openmpi-1.9a1r29179 > > ...moving over to the devel list... > > Dave and I looked at this during a break in the EuroMPI conference, and > noticed several things: > > 1. Some of the shmem interfaces are functions (i.e., return non-void) and > some are subroutines (i.e., return void). They're currently all using a > single macro to declare the interfaces, which assume functions. So this macro is incorrect for subroutines -- you really need 2 macros. > > 2. The macro name is OMPI_GENERATE_FORTRAN_BINDINGS -- why isn't is > SHMEM_GENERATE_FORTRAN_BINDINGS? > > 3. I notice that none of the Fortran interfaces are prototyped in shmem.fh. > Why not? A shmem person here in Madrid mentioned that there is supposed to be > a shmem.fh file and a shmem modulefile. > > > On Sep 17, 2013, at 8:49 AM, Siegmar Gross > <siegmar.gr...@informatik.hs-fulda.de> wrote: > > > Hi, > > > > I tried to install openmpi-1.9a1r29179 on "openSuSE Linux 12.1", > > "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C 5.12" in > > 64-bit mode. Unfortunately "make" breaks with the same error on all > > platforms. > > > > tail -15 log.make.Linux.x86_64.64_cc > > > > CCLD libshmem_c.la > > make[3]: Leaving directory > > `/export2/src/openmpi-1.9/openmpi-1.9a1r29179-Linux.x86_64.64_cc/oshmem/shmem/c' > > make[2]: Leaving directory > > `/export2/src/openmpi-1.9/openmpi-1.9a1r29179-Linux.x86_64.64_cc/oshmem/shmem/c' > > Making all in shmem/fortran > > make[2]: Entering directory > > `/export2/src/openmpi-1.9/openmpi-1.9a1r29179-Linux.x86_64.64_cc/oshmem/shmem/fortran' > > CC start
Re: [OMPI devel] v1.7.0rc7
Hi > This release candidate is the last one we expect to have > before release, so please test it. Can be downloaded from > the usual place: > > http://www.open-mpi.org/software/ompi/v1.7/ > > Latest changes include: > > * update of the alps/lustre configure code > * fixed solaris hwloc code > * various mxm updates > * removed java bindings (delayed until later release) > * improved the --report-bindings output > * a variety of minor cleanups My rankfiles don't work. tyr rankfiles 106 ompi_info | grep "MPI:" Open MPI: 1.7rc7 tyr rankfiles 107 mpiexec -report-bindings -rf rf_ex_linpc hostname -- All nodes which are allocated for this job are already filled. -- tyr rankfiles 108 mpiexec -report-bindings -rf rf_ex_sunpc hostname -- All nodes which are allocated for this job are already filled. -- tyr rankfiles 109 mpiexec -report-bindings -rf rf_ex_sunpc_linpc hostname -- All nodes which are allocated for this job are already filled. -- tyr rankfiles 110 They work as expected for openmpi-1.6.4. tyr rankfiles 99 ompi_info | grep "MPI:" Open MPI: 1.6.4rc4r28039 tyr rankfiles 100 mpiexec -report-bindings -rf rf_ex_linpc hostname [linpc0:17655] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) linpc0 linpc1 [linpc1:06707] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) linpc1 [linpc1:06707] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0) [linpc1:06707] MCW rank 3 bound to socket 1[core 1]: [. .][. B] (slot list 1:1) linpc1 tyr rankfiles 101 mpiexec -report-bindings -rf rf_ex_sunpc hostname [sunpc0:22706] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) sunpc0 sunpc1 [sunpc1:25189] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) sunpc1 [sunpc1:25189] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0) [sunpc1:25189] MCW rank 3 bound to socket 1[core 1]: [. .][. B] (slot list 1:1) sunpc1 tyr rankfiles 102 mpiexec -report-bindings -rf rf_ex_sunpc_linpc hostname [linpc1:06777] MCW rank 0 bound to socket 0[core 0-1] socket 1[core 0-1]: [B B][B B] (slot list 0:0-1,1:0-1) linpc1 sunpc1 [sunpc1:25226] MCW rank 1 bound to socket 0[core 0-1]: [B B][. .] (slot list 0:0-1) sunpc1 [sunpc1:25226] MCW rank 2 bound to socket 1[core 0]: [. .][B .] (slot list 1:0) [sunpc1:25226] MCW rank 3 bound to socket 1[core 1]: [. .][. B] (slot list 1:1) sunpc1 tyr rankfiles 103 Kind regards Siegmar
Re: [OMPI devel] RFC: Remove (broken) heterogeneous support
Hi > WHAT: Remove the configure command line option to enable heterogeneous > support > > WHY: The heterogeneous conversion code isn't working, very few people > use this feature > > WHERE: README and config/opal_configure_options.m4. See attached patch. > > TIMEOUT: Next Tuesday teleconf, 5 Feb, 2013 > > MORE DETAIL: > > The heterogeneous code has been broken for a while. The assumption > is that this is a minor bug that can fairly easily be fixed, but a) > no one has taken the time to do so, b) very few people use this > functionality, and c) many OMPI developers don't even have hardware > where to test this scenario (e.g., big and little endian systems). > > As such, a suggestion was made to remove the --enable-heterogeneous > configure CLI switch so that users don't try to enable it. It > someone ever fixes the heterogeneous code, the configure CLI switch > can be put back. I have no problem with the option --enable-heterogeneous, when I build Open MPI, but Open MPI will not work in a heterogeneous environment with little and big endian machines, while LAM MPI can handle such environments. You wanted to solve this problem. https://svn.open-mpi.org/trac/ompi/ticket/3430 I would appreciate if you wouldn't remove this option. Kind regards Siegmar