Paul,

the latest master nightly snapshot does include the fix, and i made PRs for v2.x and v1.10

Cheers,

Gilles

On 9/28/2015 6:29 PM, Gilles Gouaillardet wrote:
Thanks Brice,

I will do the PR for the various ompi branches from tomorrow

Cheers,

Gilles

Brice Goglin <brice.gog...@inria.fr> wrote:
Sorry, I didn't see this report before the pull request.

I applied Gilles' "simple but arguable" fix to master and stable branches up to v1.9. It could be too imperfect if somebody ever changes to permissions of /devices/pci* but I guess that's not going to happen in practice. Finding the right device path and checking permissions inside hwloc looks more arguable to me.
Thanks!

I am adding a filter to my email client to avoid missing hwloc-related things among OMPI mails.

Brice




Le 28/09/2015 06:23, Gilles Gouaillardet a écrit :
Paul and Brice,

the error message is displayed by libpciaccess when hwloc invokes pci_system_init

on Solaris :
crw------- 1 root sys 182, 253 Sep 28 10:55 /devices/pci@0,0:reg

from libpciaccess

   snprintf(nexus_path, sizeof(nexus_path), "/devices%s", nexus_name);
    if ((fd = open(nexus_path, O_RDWR | O_CLOEXEC)) >= 0) {
[...]
    } else {
        (void) fprintf(stderr, "Error opening %s: %s\n",
                       nexus_path, strerror(errno));
[...]
    }

i noted some TODO comments in the code to handle this.
since this piece of code is deep inside libpciaccess, i guess a fix is not trivial. unless libpciaccess is modified (for example, do not fprintf if a given environment variable is set), hwloc should "emulate" pieces of libpciaccess to get the devices path, check the permissions and
invoke pci_system_init only if everything is ok.


an other simpler (but arguable ...) option, is not to probe the PCI bus on Solaris unless root i made PR #136 https://github.com/open-mpi/hwloc/pull/136 to implement this

Cheers,

Gilles

On 9/26/2015 9:24 AM, Paul Hargrove wrote:
FYI:

Things look fine today with last night's master tarball.

I hope Brice has a way to eliminate the hwloc warning, since I am sure I am not the only one with scripts that will notice "Error" in the output.

-Paul

On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

    Aha! Thanks - just what the doctor ordered!


    On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet
    <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Ralph,

    the root cause is
    getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
    fails with errno ENOPROTOOPT on solaris 11.2

    the attached patch is a proof of concept and works for me :
    /* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */

    Cheers,

    Gilles

    On 9/21/2015 2:16 PM, Paul Hargrove wrote:
    Ralph,

    Just as you say:
    The first 64s pause was before the hwloc error message appeared.
    The second was after the second server_setup_fork appears, and
    before whatever line came after that.

    I don't know if stdio buffering my be "distorting" the
    placement of the pause relative to the lines of output.
    However, prior to your patch the entire failed mpirun was
    around 1s.

    No allocation.
    No resource manager.
    Just a single workstation.

    -Paul

    On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain
    <r...@open-mpi.org> wrote:

        ?? Just so this old fossilized brain gets this right: you
        are saying there was a 64s pause before the hwloc error
        appeared, and then another 64s pause after the second
        server_setup_fork message appeared?

        If that’s true, then I’m chasing the wrong problem - it
        sounds like something is messed up in the mpirun startup.
        Did you have more than one node in the allocation by
        chance? I’m wondering if we are getting held up by
        something in the daemon launch/callback area.



        On Sep 20, 2015, at 4:08 PM, Paul Hargrove
        <phhargr...@lbl.gov> wrote:

        Ralph,

        Still failing with that patch, but with the addition of a
        fairly long pause (64s) before the first error message
        appears, and again after the second "server setup_fork"
        (64s again)

        New output is attached.

        -Paul

        On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain
        <r...@open-mpi.org> wrote:

            Argh - found a typo in the output line. Could you
            please try the attached patch and do it again? This
            might fix it, but if not it will provide me with some
            idea of the returned error.

            Thanks
            Ralph


            On Sep 20, 2015, at 12:40 PM, Paul Hargrove
            <phhargr...@lbl.gov> wrote:

            Yes, it is definitely at 10.
            Another attempt is attached.
            -Paul

            On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain
            <r...@open-mpi.org> wrote:

                Paul - can you please confirm that you gave
                mpirun a level of 10 for the pmix_base_verbose
                param? This output isn’t what I would have
                expected from that level - it looks more like
                the verbosity was set to 5, and so the error
                number isn’t printed.

                Thanks
                Ralph


                On Sep 20, 2015, at 3:42 AM, Gilles
                Gouaillardet <gilles.gouaillar...@gmail.com> wrote:

                Paul,

                I do not remember it like that ...

                at that time, the issue in ompi was that the
                global errno was uses instead of the per thread
                errno.
                though the man pages tells -mt should be used
                fir multithreaded apps, you tried -D_REENTRANT
                on all your platforms, and it was enough to get
                the expected result.

                I just wanted to check pmix1xx (sub)configure
                did correctly pass the -D_REENTRANT flag, and
                it does. so this is very likely a new and
                unrelated error

                Cheers,

                Gilles

                On Sunday, September 20, 2015, Paul Hargrove
                <phhargr...@lbl.gov> wrote:

                    Gilles,

                    Yes every $CC invocation
                    in opal/mca/pmix/pmix1xx includes
                    "-D_REENTRANT".
                    However, they don't include "-mt".
                    I believe we concluded (when we had
                    problems previously) that "-mt" was the
                    proper flag (at compile and link) for
                    multi-threaded with the Studio compilers.

                    -Paul

                    On Sat, Sep 19, 2015 at 11:29 PM, Gilles
                    Gouaillardet<gilles.gouaillar...@gmail.com>wrote:

                        Paul,

                        Can you please double check pmix1xx is
                        compiled with -D_REENTRANT ?
                        We ran into similar issues in the past,
                        and they only occurred with Solaris

                        Cheers,

                        Gilles


                        On Sunday, September 20, 2015, Paul
                        Hargrove <phhargr...@lbl.gov> wrote:

                            Ralph,
                            The output from the requested run
                            is attached.
                            -Paul

                            On Sat, Sep 19, 2015 at 9:46 PM,
                            Ralph Castain<r...@open-mpi.org>wrote:

                                Ah, okay - that makes more
                                sense. I’ll have to let Brice
                                see if he can figure out how to
                                silence the hwloc error message
                                as I can’t find where it came
                                from. The other errors are real
                                and are the reason why the job
                                was terminated.

                                The problem is that we are
                                trying to establish a
                                communication between the app
                                and the daemon via unix domain
                                socket, and we failed to do so.
                                The error tells me that we were
                                able to create and connect to
                                the socket, but failed when the
                                daemon tried to do a blocking
                                send to the app.

                                Can you rerun it with -mca
                                pmix_base_verbose 10? It will
                                tell us the value of the error
                                number that was returned

                                Thanks
                                Ralph


                                On Sep 19, 2015, at 9:37 PM,
                                Paul Hargrove
                                <phhargr...@lbl.gov> wrote:

                                Ralph,

                                No it did not run.
                                The complete output (which I
                                really should have included in
                                the first place) is below.

                                -Paul

                                $ mpirun -mca btl sm,self -np
                                2 examples/ring_c'
                                Error opening
                                /devices/pci@0,0:reg:
                                Permission denied
                                [pcp-d-3:26054] PMIX ERROR:
                                ERROR in file
                                
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
                                at line 181
                                [pcp-d-3:26053] PMIX ERROR:
                                UNREACHABLE in file
                                
/export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
                                at line 463
                                
--------------------------------------------------------------------------
                                It looks like MPI_INIT failed
                                for some reason; your parallel
                                process is
                                likely to abort. There are
                                many reasons that a parallel
                                process can
                                fail during MPI_INIT; some of
                                which are due to configuration
                                or environment
                                problems. This failure appears
                                to be an internal failure;
                                here's some
                                additional information (which
                                may only be relevant to an
                                Open MPI
                                developer):

                                ompi_mpi_init: ompi_rte_init
                                failed
                                --> Returned "(null)" (-43)
                                instead of "Success" (0)
                                
--------------------------------------------------------------------------
                                *** An error occurred in MPI_Init
                                *** on a NULL communicator
                                *** MPI_ERRORS_ARE_FATAL
                                (processes in this
                                communicator will now abort,
                                ***    and potentially your
                                MPI job)
                                [pcp-d-3:26054] Local abort
                                before MPI_INIT completed
                                completed successfully, but am
                                not able to aggregate error
                                messages, and not able to
                                guarantee that all other
                                processes were killed!
                                
-------------------------------------------------------
                                Primary job  terminated
                                normally, but 1 process returned
                                a non-zero exit code.. Per
                                user-direction, the job has
                                been aborted.
                                
-------------------------------------------------------
                                
--------------------------------------------------------------------------
                                mpirun detected that one or
                                more processes exited with
                                non-zero status, thus causing
                                the job to be terminated. The
                                first process to do so was:

                                Process name: [[11371,1],0]
                                Exit code:    1
                                
--------------------------------------------------------------------------

                                On Sat, Sep 19, 2015 at 8:50
                                PM, Ralph
                                Castain<r...@open-mpi.org>wrote:

                                    Paul, can you clarify
                                    something for me? The
                                    error in this case
                                    indicates that the client
                                    wasn’t able to reach the
                                    daemon - this should have
                                    resulted in termination of
                                    the job. Did the job
                                    actually run?


                                    On Sep 18, 2015, at 2:50
                                    AM, Ralph Castain
                                    <r...@open-mpi.org> wrote:

                                    I'm on travel right now,
                                    but it should be an easy
                                    fix when I return. Sorry
                                    for the annoyance


                                    On Thu, Sep 17, 2015 at
                                    11:13 PM, Paul
                                    Hargrove<phhargr...@lbl.gov>wrote:

                                        Any suggestion how I
                                        (as a non-root user)
                                        can avoid seeing this
                                        hwloc error message
                                        on every run?

                                        -Paul

                                        On Thu, Sep 17, 2015
                                        at 11:00 PM, Gilles
                                        Gouaillardet<gil...@rist.or.jp>wrote:

                                            Paul,

                                            IIRC, the
                                            "Permission
                                            denied" is coming
                                            from hwloc that
                                            cannot collect
                                            all the info it
                                            would like.

                                            Cheers,

                                            Gilles

                                            On 9/18/2015 2:34
                                            PM, Paul Hargrove
                                            wrote:
                                            Tried tonight's
                                            master tarball
                                            on Solaris 11.2
                                            on x86-64 with
                                            the Studio
                                            Compilers
                                             (default ILP32
                                            output) and saw
                                            the following
                                            result

                                            $ mpirun -mca
                                            btl sm,self -np
                                            2 examples/ring_c'
                                            Error opening
                                            /devices/pci@0,0:reg:
                                            Permission denied
                                            [pcp-d-4:00492]
                                            PMIX ERROR:
                                            ERROR in file
                                            
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
                                            at line 181
                                            [pcp-d-4:00491]
                                            PMIX ERROR:
                                            UNREACHABLE in
                                            file
                                            
/export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
                                            at line 463

                                            I don't know if
                                            the Permission
                                            denied error is
                                            related to the
                                            subsequent PMIX
                                            errors, but any
                                            message that
                                            says
                                            "UNREACHABLE" is
                                            clearly worth
                                            reporting.

                                            -Paul

                                            --
                                            Paul H. Hargrove
                                            phhargr...@lbl.gov
                                            Computer
                                            Languages &
                                            Systems Software
                                            (CLaSS) Group
                                            Computer Science
Department Tel:+1-510-495-2352
                                            <tel:%2B1-510-495-2352>
                                            Lawrence
                                            Berkeley
                                            National
                                            Laboratory
                                            Fax:+1-510-486-6900
                                            <tel:%2B1-510-486-6900>


                                            
_______________________________________________
                                            devel mailing list
                                            de...@open-mpi.org
                                            
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                            Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18074.php


                                            
_______________________________________________
                                            devel mailing list
                                            de...@open-mpi.org
                                            
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                            Link to this
                                            
post:http://www.open-mpi.org/community/lists/devel/2015/09/18075.php




                                        --
                                        Paul H. Hargrove
                                        phhargr...@lbl.gov
                                        Computer Languages &
                                        Systems Software
                                        (CLaSS) Group
                                        Computer Science
Department Tel:+1-510-495-2352
                                        <tel:%2B1-510-495-2352>
                                        Lawrence Berkeley
                                        National Laboratory
                                        Fax:+1-510-486-6900
                                        <tel:%2B1-510-486-6900>

                                        
_______________________________________________
                                        devel mailing list
                                        de...@open-mpi.org
                                        
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                        Link to this
                                        
post:http://www.open-mpi.org/community/lists/devel/2015/09/18076.php




                                    
_______________________________________________
                                    devel mailing list
                                    de...@open-mpi.org
                                    
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                    Link to this
                                    
post:http://www.open-mpi.org/community/lists/devel/2015/09/18078.php




                                --
                                Paul H. Hargrove
                                phhargr...@lbl.gov
                                Computer Languages & Systems
                                Software (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352
                                <tel:%2B1-510-495-2352>
                                Lawrence Berkeley National
                                Laboratory Fax:+1-510-486-6900
                                <tel:%2B1-510-486-6900>
                                _______________________________________________
                                devel mailing list
                                de...@open-mpi.org
                                
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                Link to this
                                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18080.php


                                _______________________________________________
                                devel mailing list
                                de...@open-mpi.org
                                
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                Link to this
                                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18081.php




                            --
                            Paul H. Hargrove phhargr...@lbl.gov
                            Computer Languages & Systems
                            Software (CLaSS) Group
Computer Science Department Tel:+1-510-495-2352
                            <tel:%2B1-510-495-2352>
                            Lawrence Berkeley National
                            Laboratory Fax:+1-510-486-6900
                            <tel:%2B1-510-486-6900>


                        _______________________________________________
                        devel mailing list
                        de...@open-mpi.org
                        
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
                        Link to this
                        
post:http://www.open-mpi.org/community/lists/devel/2015/09/18083.php




                    --
                    Paul H. Hargrove phhargr...@lbl.gov
                    Computer Languages & Systems Software
                    (CLaSS) Group
                    Computer Science Department           Tel:
                    +1-510-495-2352 <tel:%2B1-510-495-2352>
                    Lawrence Berkeley National Laboratory Fax:
                    +1-510-486-6900 <tel:%2B1-510-486-6900>

                _______________________________________________
                devel mailing list
                de...@open-mpi.org
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/devel
                Link to this
                
post:http://www.open-mpi.org/community/lists/devel/2015/09/18085.php


                _______________________________________________
                devel mailing list
                de...@open-mpi.org
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/devel
                Link to this post:
                http://www.open-mpi.org/community/lists/devel/2015/09/18086.php




-- Paul H. Hargrove phhargr...@lbl.gov
            Computer Languages & Systems Software (CLaSS) Group
            Computer Science Department           Tel:
            +1-510-495-2352 <tel:%2B1-510-495-2352>
            Lawrence Berkeley National Laboratory Fax:
            +1-510-486-6900 <tel:%2B1-510-486-6900>
            <typescript>_______________________________________________
            devel mailing list
            de...@open-mpi.org
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/devel
            Link to this post:
            http://www.open-mpi.org/community/lists/devel/2015/09/18087.php


            _______________________________________________
            devel mailing list
            de...@open-mpi.org
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/devel
            Link to this post:
            http://www.open-mpi.org/community/lists/devel/2015/09/18088.php




-- Paul H. Hargrove phhargr...@lbl.gov
        Computer Languages & Systems Software (CLaSS) Group
        Computer Science Department           Tel:
        +1-510-495-2352 <tel:%2B1-510-495-2352>
        Lawrence Berkeley National Laboratory Fax:
        +1-510-486-6900 <tel:%2B1-510-486-6900>
        <typescript>_______________________________________________
        devel mailing list
        de...@open-mpi.org
        Subscription:
        http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/09/18089.php


        _______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        Subscription:
        http://www.open-mpi.org/mailman/listinfo.cgi/devel
        Link to this post:
        http://www.open-mpi.org/community/lists/devel/2015/09/18092.php




-- Paul H. Hargrove phhargr...@lbl.gov
    Computer Languages & Systems Software (CLaSS) Group
    Computer Science Department               Tel: +1-510-495-2352
    <tel:%2B1-510-495-2352>
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    <tel:%2B1-510-486-6900>


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18093.php

    <pmix_client.diff>_______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/09/18101.php


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2015/09/18102.php




--
Paul H. Hargrove phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900


_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18109.php



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this 
post:http://www.open-mpi.org/community/lists/devel/2015/09/18110.php



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/09/18112.php

Reply via email to