Thanks Brice,

I will do the PR for the various ompi branches from tomorrow

Cheers,

Gilles

Brice Goglin <brice.gog...@inria.fr> wrote:
>Sorry, I didn't see this report before the pull request.
>
>I applied Gilles' "simple but arguable" fix to master and stable branches up 
>to v1.9. It could be too imperfect if somebody ever changes to permissions of 
>/devices/pci* but I guess that's not going to happen in practice. Finding the 
>right device path and checking permissions inside hwloc looks more arguable to 
>me.
>Thanks!
>
>I am adding a filter to my email client to avoid missing hwloc-related things 
>among OMPI mails.
>
>Brice
>
>
>
>
>Le 28/09/2015 06:23, Gilles Gouaillardet a écrit :
>
>Paul and Brice,
>
>the error message is displayed by libpciaccess when hwloc invokes 
>pci_system_init
>
>on Solaris :
>crw-------   1 root     sys      182, 253 Sep 28 10:55 /devices/pci@0,0:reg
>
>from libpciaccess
>
>   snprintf(nexus_path, sizeof(nexus_path), "/devices%s", nexus_name);
>    if ((fd = open(nexus_path, O_RDWR | O_CLOEXEC)) >= 0) {
>[...]
>    } else {
>        (void) fprintf(stderr, "Error opening %s: %s\n",
>                       nexus_path, strerror(errno));
>[...]   
>    }
>
>i noted some TODO comments in the code to handle this.
>since this piece of code is deep inside libpciaccess, i guess a fix is not 
>trivial.
>unless libpciaccess is modified (for example, do not fprintf if a given 
>environment variable is set),
>hwloc should "emulate" pieces of libpciaccess to get the devices path, check 
>the permissions and
>invoke pci_system_init only if everything is ok.
>
>
>an other simpler (but arguable ...) option, is not to probe the PCI bus on 
>Solaris unless root
>i made PR #136 https://github.com/open-mpi/hwloc/pull/136 to implement this
>
>Cheers,
>
>Gilles
>
>On 9/26/2015 9:24 AM, Paul Hargrove wrote:
>
>FYI: 
>
>
>Things look fine today with last night's master tarball.
>
>
>I hope Brice has a way to eliminate the hwloc warning, since I am sure I am 
>not the only one with scripts that will notice "Error" in the output.
>
>
>-Paul
>
>
>On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Aha! Thanks - just what the doctor ordered! 
>
>
>
>On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
>
>
>Ralph,
>
>the root cause is
>getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
>fails with errno ENOPROTOOPT on solaris 11.2
>
>the attached patch is a proof of concept and works for me :
>/* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */
>
>Cheers,
>
>Gilles
>
>On 9/21/2015 2:16 PM, Paul Hargrove wrote:
>
>Ralph,
>
>Just as you say:
>The first 64s pause was before the hwloc error message appeared.
>The second was after the second server_setup_fork appears, and before whatever 
>line came after that.
>
>I don't know if stdio buffering my be "distorting" the placement of the pause 
>relative to the lines of output.
>However, prior to your patch the entire failed mpirun was around 1s.
>
>No allocation. 
>
>No resource manager.
>Just a single workstation.
>
>-Paul
>
>
>On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>?? Just so this old fossilized brain gets this right: you are saying there was 
>a 64s pause before the hwloc error appeared, and then another 64s pause after 
>the second server_setup_fork message appeared? 
>
>
>If that’s true, then I’m chasing the wrong problem - it sounds like something 
>is messed up in the mpirun startup. Did you have more than one node in the 
>allocation by chance? I’m wondering if we are getting held up by something in 
>the daemon launch/callback area.
>
>
>
>
>On Sep 20, 2015, at 4:08 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Ralph, 
>
>
>Still failing with that patch, but with the addition of a fairly long pause 
>(64s) before the first error message appears, and again after the second 
>"server setup_fork" (64s again)
>
>
>New output is attached.
>
>
>-Paul
>
>
>On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Argh - found a typo in the output line. Could you please try the attached 
>patch and do it again? This might fix it, but if not it will provide me with 
>some idea of the returned error. 
>
>
>Thanks
>
>Ralph
>
>
>
>On Sep 20, 2015, at 12:40 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Yes, it is definitely at 10. 
>
>Another attempt is attached.
>
>-Paul
>
>
>On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Paul - can you please confirm that you gave mpirun a level of 10 for the 
>pmix_base_verbose param? This output isn’t what I would have expected from 
>that level - it looks more like the verbosity was set to 5, and so the error 
>number isn’t printed. 
>
>
>Thanks
>
>Ralph
>
>
>
>On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet 
><gilles.gouaillar...@gmail.com> wrote:
>
>
>Paul, 
>
>
>I do not remember it like that ...
>
>
>at that time, the issue in ompi was that the global errno was uses instead of 
>the per thread errno.
>
>though the man pages tells -mt should be used fir multithreaded apps, you 
>tried -D_REENTRANT on all your platforms, and it was enough to get the 
>expected result.
>
>
>I just wanted to check pmix1xx (sub)configure did correctly pass the 
>-D_REENTRANT flag, and it does. so this is very likely a new and unrelated 
>error
>
>
>Cheers,
>
>
>Gilles
>
>
>On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>Gilles, 
>
>
>Yes every $CC invocation in opal/mca/pmix/pmix1xx includes "-D_REENTRANT".
>
>However, they don't include "-mt".
>
>I believe we concluded (when we had problems previously) that "-mt" was the 
>proper flag (at compile and link) for multi-threaded with the Studio compilers.
>
>
>-Paul
>
>
>On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet 
><gilles.gouaillar...@gmail.com> wrote:
>
>Paul, 
>
>
>Can you please double check pmix1xx is compiled with -D_REENTRANT ?
>
>We ran into similar issues in the past, and they only occurred with Solaris 
>
>
>Cheers,
>
>
>Gilles 
>
>
>
>On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>Ralph, 
>
>The output from the requested run is attached.
>
>-Paul
>
>
>On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Ah, okay - that makes more sense. I’ll have to let Brice see if he can figure 
>out how to silence the hwloc error message as I can’t find where it came from. 
>The other errors are real and are the reason why the job was terminated. 
>
>
>The problem is that we are trying to establish a communication between the app 
>and the daemon via unix domain socket, and we failed to do so. The error tells 
>me that we were able to create and connect to the socket, but failed when the 
>daemon tried to do a blocking send to the app.
>
>
>Can you rerun it with -mca pmix_base_verbose 10? It will tell us the value of 
>the error number that was returned
>
>
>Thanks
>
>Ralph
>
>
>
>On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>
>Ralph, 
>
>
>No it did not run.
>
>The complete output (which I really should have included in the first place) 
>is below.
>
>
>-Paul
>
>
>$ mpirun -mca btl sm,self -np 2 examples/ring_c'
>
>Error opening /devices/pci@0,0:reg: Permission denied
>
>[pcp-d-3:26054] PMIX ERROR: ERROR in file 
>/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
> at line 181
>
>[pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file 
>/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
> at line 463
>
>--------------------------------------------------------------------------
>
>It looks like MPI_INIT failed for some reason; your parallel process is
>
>likely to abort.  There are many reasons that a parallel process can
>
>fail during MPI_INIT; some of which are due to configuration or environment
>
>problems.  This failure appears to be an internal failure; here's some
>
>additional information (which may only be relevant to an Open MPI
>
>developer):
>
>
>  ompi_mpi_init: ompi_rte_init failed
>
>  --> Returned "(null)" (-43) instead of "Success" (0)
>
>--------------------------------------------------------------------------
>
>*** An error occurred in MPI_Init
>
>*** on a NULL communicator
>
>*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>
>***    and potentially your MPI job)
>
>[pcp-d-3:26054] Local abort before MPI_INIT completed completed successfully, 
>but am not able to aggregate error messages, and not able to guarantee that 
>all other processes were killed!
>
>-------------------------------------------------------
>
>Primary job  terminated normally, but 1 process returned
>
>a non-zero exit code.. Per user-direction, the job has been aborted.
>
>-------------------------------------------------------
>
>--------------------------------------------------------------------------
>
>mpirun detected that one or more processes exited with non-zero status, thus 
>causing
>
>the job to be terminated. The first process to do so was:
>
>
>  Process name: [[11371,1],0]
>
>  Exit code:    1
>
>--------------------------------------------------------------------------
>
>
>On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>Paul, can you clarify something for me? The error in this case indicates that 
>the client wasn’t able to reach the daemon - this should have resulted in 
>termination of the job. Did the job actually run? 
>
>
>
>On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>
>I'm on travel right now, but it should be an easy fix when I return. Sorry for 
>the annoyance 
>
>
>
>On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
>Any suggestion how I (as a non-root user) can avoid seeing this hwloc error 
>message on every run? 
>
>
>-Paul
>
>
>On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet <gil...@rist.or.jp> 
>wrote:
>
>Paul,
>
>IIRC, the "Permission denied" is coming from hwloc that cannot collect all the 
>info it would like.
>
>Cheers,
>
>Gilles 
>
>
>On 9/18/2015 2:34 PM, Paul Hargrove wrote:
>
>Tried tonight's master tarball on Solaris 11.2 on x86-64 with the Studio 
>Compilers  (default ILP32 output) and saw the following result 
>
>
>$ mpirun -mca btl sm,self -np 2 examples/ring_c'
>
>Error opening /devices/pci@0,0:reg: Permission denied
>
>[pcp-d-4:00492] PMIX ERROR: ERROR in file 
>/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
> at line 181
>
>[pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file 
>/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
> at line 463
>
>
>I don't know if the Permission denied error is related to the subsequent PMIX 
>errors, but any message that says "UNREACHABLE" is clearly worth reporting.
>
>
>-Paul
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18074.php 
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18075.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18076.php
>
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18078.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18080.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18081.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18083.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18085.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18086.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
><typescript>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18087.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18088.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
><typescript>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18089.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18092.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18093.php 
>
>
><pmix_client.diff>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18101.php
>
>
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18102.php
>
>
>
>
>-- 
>
>Paul H. Hargrove                          phhargr...@lbl.gov
>
>Computer Languages & Systems Software (CLaSS) Group
>
>Computer Science Department               Tel: +1-510-495-2352
>
>Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>
>
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18109.php 
>
>
>
>
>_______________________________________________ devel mailing list 
>de...@open-mpi.org Subscription: 
>http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2015/09/18110.php 

Reply via email to