Thanks Brice, I will do the PR for the various ompi branches from tomorrow
Cheers, Gilles Brice Goglin <brice.gog...@inria.fr> wrote: >Sorry, I didn't see this report before the pull request. > >I applied Gilles' "simple but arguable" fix to master and stable branches up >to v1.9. It could be too imperfect if somebody ever changes to permissions of >/devices/pci* but I guess that's not going to happen in practice. Finding the >right device path and checking permissions inside hwloc looks more arguable to >me. >Thanks! > >I am adding a filter to my email client to avoid missing hwloc-related things >among OMPI mails. > >Brice > > > > >Le 28/09/2015 06:23, Gilles Gouaillardet a écrit : > >Paul and Brice, > >the error message is displayed by libpciaccess when hwloc invokes >pci_system_init > >on Solaris : >crw------- 1 root sys 182, 253 Sep 28 10:55 /devices/pci@0,0:reg > >from libpciaccess > > snprintf(nexus_path, sizeof(nexus_path), "/devices%s", nexus_name); > if ((fd = open(nexus_path, O_RDWR | O_CLOEXEC)) >= 0) { >[...] > } else { > (void) fprintf(stderr, "Error opening %s: %s\n", > nexus_path, strerror(errno)); >[...] > } > >i noted some TODO comments in the code to handle this. >since this piece of code is deep inside libpciaccess, i guess a fix is not >trivial. >unless libpciaccess is modified (for example, do not fprintf if a given >environment variable is set), >hwloc should "emulate" pieces of libpciaccess to get the devices path, check >the permissions and >invoke pci_system_init only if everything is ok. > > >an other simpler (but arguable ...) option, is not to probe the PCI bus on >Solaris unless root >i made PR #136 https://github.com/open-mpi/hwloc/pull/136 to implement this > >Cheers, > >Gilles > >On 9/26/2015 9:24 AM, Paul Hargrove wrote: > >FYI: > > >Things look fine today with last night's master tarball. > > >I hope Brice has a way to eliminate the hwloc warning, since I am sure I am >not the only one with scripts that will notice "Error" in the output. > > >-Paul > > >On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain <r...@open-mpi.org> wrote: > >Aha! Thanks - just what the doctor ordered! > > > >On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > >Ralph, > >the root cause is >getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...) >fails with errno ENOPROTOOPT on solaris 11.2 > >the attached patch is a proof of concept and works for me : >/* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */ > >Cheers, > >Gilles > >On 9/21/2015 2:16 PM, Paul Hargrove wrote: > >Ralph, > >Just as you say: >The first 64s pause was before the hwloc error message appeared. >The second was after the second server_setup_fork appears, and before whatever >line came after that. > >I don't know if stdio buffering my be "distorting" the placement of the pause >relative to the lines of output. >However, prior to your patch the entire failed mpirun was around 1s. > >No allocation. > >No resource manager. >Just a single workstation. > >-Paul > > >On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org> wrote: > >?? Just so this old fossilized brain gets this right: you are saying there was >a 64s pause before the hwloc error appeared, and then another 64s pause after >the second server_setup_fork message appeared? > > >If that’s true, then I’m chasing the wrong problem - it sounds like something >is messed up in the mpirun startup. Did you have more than one node in the >allocation by chance? I’m wondering if we are getting held up by something in >the daemon launch/callback area. > > > > >On Sep 20, 2015, at 4:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Ralph, > > >Still failing with that patch, but with the addition of a fairly long pause >(64s) before the first error message appears, and again after the second >"server setup_fork" (64s again) > > >New output is attached. > > >-Paul > > >On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote: > >Argh - found a typo in the output line. Could you please try the attached >patch and do it again? This might fix it, but if not it will provide me with >some idea of the returned error. > > >Thanks > >Ralph > > > >On Sep 20, 2015, at 12:40 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Yes, it is definitely at 10. > >Another attempt is attached. > >-Paul > > >On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain <r...@open-mpi.org> wrote: > >Paul - can you please confirm that you gave mpirun a level of 10 for the >pmix_base_verbose param? This output isn’t what I would have expected from >that level - it looks more like the verbosity was set to 5, and so the error >number isn’t printed. > > >Thanks > >Ralph > > > >On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet ><gilles.gouaillar...@gmail.com> wrote: > > >Paul, > > >I do not remember it like that ... > > >at that time, the issue in ompi was that the global errno was uses instead of >the per thread errno. > >though the man pages tells -mt should be used fir multithreaded apps, you >tried -D_REENTRANT on all your platforms, and it was enough to get the >expected result. > > >I just wanted to check pmix1xx (sub)configure did correctly pass the >-D_REENTRANT flag, and it does. so this is very likely a new and unrelated >error > > >Cheers, > > >Gilles > > >On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: > >Gilles, > > >Yes every $CC invocation in opal/mca/pmix/pmix1xx includes "-D_REENTRANT". > >However, they don't include "-mt". > >I believe we concluded (when we had problems previously) that "-mt" was the >proper flag (at compile and link) for multi-threaded with the Studio compilers. > > >-Paul > > >On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet ><gilles.gouaillar...@gmail.com> wrote: > >Paul, > > >Can you please double check pmix1xx is compiled with -D_REENTRANT ? > >We ran into similar issues in the past, and they only occurred with Solaris > > >Cheers, > > >Gilles > > > >On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: > >Ralph, > >The output from the requested run is attached. > >-Paul > > >On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org> wrote: > >Ah, okay - that makes more sense. I’ll have to let Brice see if he can figure >out how to silence the hwloc error message as I can’t find where it came from. >The other errors are real and are the reason why the job was terminated. > > >The problem is that we are trying to establish a communication between the app >and the daemon via unix domain socket, and we failed to do so. The error tells >me that we were able to create and connect to the socket, but failed when the >daemon tried to do a blocking send to the app. > > >Can you rerun it with -mca pmix_base_verbose 10? It will tell us the value of >the error number that was returned > > >Thanks > >Ralph > > > >On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > >Ralph, > > >No it did not run. > >The complete output (which I really should have included in the first place) >is below. > > >-Paul > > >$ mpirun -mca btl sm,self -np 2 examples/ring_c' > >Error opening /devices/pci@0,0:reg: Permission denied > >[pcp-d-3:26054] PMIX ERROR: ERROR in file >/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c > at line 181 > >[pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file >/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c > at line 463 > >-------------------------------------------------------------------------- > >It looks like MPI_INIT failed for some reason; your parallel process is > >likely to abort. There are many reasons that a parallel process can > >fail during MPI_INIT; some of which are due to configuration or environment > >problems. This failure appears to be an internal failure; here's some > >additional information (which may only be relevant to an Open MPI > >developer): > > > ompi_mpi_init: ompi_rte_init failed > > --> Returned "(null)" (-43) instead of "Success" (0) > >-------------------------------------------------------------------------- > >*** An error occurred in MPI_Init > >*** on a NULL communicator > >*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > >*** and potentially your MPI job) > >[pcp-d-3:26054] Local abort before MPI_INIT completed completed successfully, >but am not able to aggregate error messages, and not able to guarantee that >all other processes were killed! > >------------------------------------------------------- > >Primary job terminated normally, but 1 process returned > >a non-zero exit code.. Per user-direction, the job has been aborted. > >------------------------------------------------------- > >-------------------------------------------------------------------------- > >mpirun detected that one or more processes exited with non-zero status, thus >causing > >the job to be terminated. The first process to do so was: > > > Process name: [[11371,1],0] > > Exit code: 1 > >-------------------------------------------------------------------------- > > >On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org> wrote: > >Paul, can you clarify something for me? The error in this case indicates that >the client wasn’t able to reach the daemon - this should have resulted in >termination of the job. Did the job actually run? > > > >On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org> wrote: > > >I'm on travel right now, but it should be an easy fix when I return. Sorry for >the annoyance > > > >On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > >Any suggestion how I (as a non-root user) can avoid seeing this hwloc error >message on every run? > > >-Paul > > >On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet <gil...@rist.or.jp> >wrote: > >Paul, > >IIRC, the "Permission denied" is coming from hwloc that cannot collect all the >info it would like. > >Cheers, > >Gilles > > >On 9/18/2015 2:34 PM, Paul Hargrove wrote: > >Tried tonight's master tarball on Solaris 11.2 on x86-64 with the Studio >Compilers (default ILP32 output) and saw the following result > > >$ mpirun -mca btl sm,self -np 2 examples/ring_c' > >Error opening /devices/pci@0,0:reg: Permission denied > >[pcp-d-4:00492] PMIX ERROR: ERROR in file >/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c > at line 181 > >[pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file >/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c > at line 463 > > >I don't know if the Permission denied error is related to the subsequent PMIX >errors, but any message that says "UNREACHABLE" is clearly worth reporting. > > >-Paul > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > >_______________________________________________ devel mailing list >de...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18074.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18075.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18076.php > > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18078.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18080.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18081.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18083.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18085.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18086.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ><typescript>_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18087.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18088.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > ><typescript>_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18089.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18092.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > >_______________________________________________ devel mailing list >de...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18093.php > > ><pmix_client.diff>_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18101.php > > > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18102.php > > > > >-- > >Paul H. Hargrove phhargr...@lbl.gov > >Computer Languages & Systems Software (CLaSS) Group > >Computer Science Department Tel: +1-510-495-2352 > >Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > > > >_______________________________________________ devel mailing list >de...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18109.php > > > > >_______________________________________________ devel mailing list >de...@open-mpi.org Subscription: >http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: >http://www.open-mpi.org/community/lists/devel/2015/09/18110.php