Ralph, The output from the requested run is attached. -Paul On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org> wrote:
> Ah, okay - that makes more sense. I’ll have to let Brice see if he can > figure out how to silence the hwloc error message as I can’t find where it > came from. The other errors are real and are the reason why the job was > terminated. > > The problem is that we are trying to establish a communication between the > app and the daemon via unix domain socket, and we failed to do so. The > error tells me that we were able to create and connect to the socket, but > failed when the daemon tried to do a blocking send to the app. > > Can you rerun it with -mca pmix_base_verbose 10? It will tell us the value > of the error number that was returned > > Thanks > Ralph > > > On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Ralph, > > No it did not run. > The complete output (which I really should have included in the first > place) is below. > > -Paul > > $ mpirun -mca btl sm,self -np 2 examples/ring_c' > Error opening /devices/pci@0,0:reg: Permission denied > [pcp-d-3:26054] PMIX ERROR: ERROR in file > /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c > at line 181 > [pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file > /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c > at line 463 > -------------------------------------------------------------------------- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_mpi_init: ompi_rte_init failed > --> Returned "(null)" (-43) instead of "Success" (0) > -------------------------------------------------------------------------- > *** An error occurred in MPI_Init > *** on a NULL communicator > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, > *** and potentially your MPI job) > [pcp-d-3:26054] Local abort before MPI_INIT completed completed > successfully, but am not able to aggregate error messages, and not able to > guarantee that all other processes were killed! > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[11371,1],0] > Exit code: 1 > -------------------------------------------------------------------------- > > On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Paul, can you clarify something for me? The error in this case indicates >> that the client wasn’t able to reach the daemon - this should have resulted >> in termination of the job. Did the job actually run? >> >> >> On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> I'm on travel right now, but it should be an easy fix when I return. >> Sorry for the annoyance >> >> >> On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove <phhargr...@lbl.gov> >> wrote: >> >>> Any suggestion how I (as a non-root user) can avoid seeing this hwloc >>> error message on every run? >>> >>> -Paul >>> >>> On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet <gil...@rist.or.jp >>> > wrote: >>> >>>> Paul, >>>> >>>> IIRC, the "Permission denied" is coming from hwloc that cannot collect >>>> all the info it would like. >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 9/18/2015 2:34 PM, Paul Hargrove wrote: >>>> >>>> Tried tonight's master tarball on Solaris 11.2 on x86-64 with the >>>> Studio Compilers (default ILP32 output) and saw the following result >>>> >>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >>>> Error opening /devices/pci@0,0:reg: Permission denied >>>> [pcp-d-4:00492] PMIX ERROR: ERROR in file >>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c >>>> at line 181 >>>> [pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file >>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>> at line 463 >>>> >>>> I don't know if the Permission denied error is related to the >>>> subsequent PMIX errors, but any message that says "UNREACHABLE" is clearly >>>> worth reporting. >>>> >>>> -Paul >>>> >>>> -- >>>> Paul H. Hargrove <phhargr...@lbl.gov> >>>> phhargr...@lbl.gov >>>> Computer Languages & Systems Software (CLaSS) Group >>>> Computer Science Department Tel: +1-510-495-2352 >>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>> >>>> >>>> _______________________________________________ >>>> devel mailing listde...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/09/18075.php >>>> >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/09/18076.php >>> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/18078.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18080.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18081.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Error opening /devices/pci@0,0:reg: Permission denied [pcp-d-3:25836] mca: base: components_register: registering framework pmix components [pcp-d-3:25836] mca: base: components_register: found loaded component pmix1xx [pcp-d-3:25836] mca: base: components_register: component pmix1xx has no register or open function [pcp-d-3:25836] mca: base: components_open: opening pmix components [pcp-d-3:25836] mca: base: components_open: found loaded component pmix1xx [pcp-d-3:25836] mca: base: components_open: component pmix1xx open function successful [pcp-d-3:25836] mca:base:select: Auto-selecting pmix components [pcp-d-3:25836] mca:base:select:( pmix) Querying component [pmix1xx] [pcp-d-3:25836] mca:base:select:( pmix) Query of component [pmix1xx] set priority to 5 [pcp-d-3:25836] mca:base:select:( pmix) Selected component [pmix1xx] [pcp-d-3:25836] pmix:server init called [pcp-d-3:25836] sec: native init [pcp-d-3:25836] sec: SPC native active [pcp-d-3:25836] pmix:server constructed uri pmix-server:25836:/tmp/openmpi-sessions-19214@pcp-d-3_0/11586/0/0/pmix-25836 [pcp-d-3:25836] listen_thread: active [pcp-d-3:25836] pmix:server register client 759300097:0 [pcp-d-3:25836] pmix:server register client 759300097:1 [pcp-d-3:25836] pmix:server _register_client for nspace 759300097 rank 0 [pcp-d-3:25836] pmix:server setup_fork for nspace 759300097 rank 0 [pcp-d-3:25836] pmix:server _register_client for nspace 759300097 rank 1 [pcp-d-3:25836] pmix:server _register_nspace [pcp-d-3:25836] pmix:server _register_nspace recording pmix.ltopo [pcp-d-3:25836] pmix:server _register_nspace recording pmix.jobid [pcp-d-3:25836] pmix:server _register_nspace recording pmix.offset [pcp-d-3:25836] pmix:server _register_nspace recording pmix.nmap [pcp-d-3:25836] pmix:extract:nodes: checking list: pcp-d-3 [pcp-d-3:25836] pmix:server _register_nspace recording pmix.pmap [pcp-d-3:25836] pmix:server _register_nspace recording pmix.nodeid [pcp-d-3:25836] pmix:server _register_nspace recording pmix.node.size [pcp-d-3:25836] pmix:server _register_nspace recording pmix.lpeers [pcp-d-3:25836] pmix:server _register_nspace recording pmix.lcpus [pcp-d-3:25836] pmix:server _register_nspace recording pmix.lldr [pcp-d-3:25836] pmix:server _register_nspace recording pmix.univ.size [pcp-d-3:25836] pmix:server _register_nspace recording pmix.job.size [pcp-d-3:25836] pmix:server _register_nspace recording pmix.local.size [pcp-d-3:25836] pmix:server _register_nspace recording pmix.max.size [pcp-d-3:25836] pmix:server _register_nspace recording pmix.pdata [pcp-d-3:25836] pmix:server _register_nspace recording pmix.pdata [pcp-d-3:25836] pmix:server setup_fork for nspace 759300097 rank 1 [pcp-d-3:25839] mca: base: components_register: registering framework pmix components [pcp-d-3:25839] mca: base: components_register: found loaded component pmix1xx [pcp-d-3:25839] mca: base: components_register: component pmix1xx has no register or open function [pcp-d-3:25839] mca: base: components_open: opening pmix components [pcp-d-3:25839] mca: base: components_open: found loaded component pmix1xx [pcp-d-3:25839] mca: base: components_open: component pmix1xx open function successful [pcp-d-3:25839] mca:base:select: Auto-selecting pmix components [pcp-d-3:25839] mca:base:select:( pmix) Querying component [pmix1xx] [pcp-d-3:25839] mca:base:select:( pmix) Query of component [pmix1xx] set priority to 100 [pcp-d-3:25839] mca:base:select:( pmix) Selected component [pmix1xx] [pcp-d-3:25839] PMIx_client init [pcp-d-3:25839] pmix: init called [pcp-d-3:25839] posting notification recv on tag 0 [pcp-d-3:25839] sec: native init [pcp-d-3:25839] sec: SPC native active [pcp-d-3:25839] usock_peer_try_connect: attempting to connect to server [pcp-d-3:25839] usock_peer_try_connect: attempting to connect to server on socket 15 [pcp-d-3:25839] pmix: SEND CONNECT ACK [pcp-d-3:25839] sec: native create_cred [pcp-d-3:25839] sec: using credential 19214:5513 [pcp-d-3:25839] send blocking of 49 bytes to socket 15 [pcp-d-3:25839] blocking send complete to socket 15 [pcp-d-3:25839] pmix: RECV CONNECT ACK FROM SERVER [pcp-d-3:25839] PMIX ERROR: ERROR in file /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c at line 181 [pcp-d-3:25836] listen_thread: new connection: (32, 0) [pcp-d-3:25839] mca: base: close: component pmix1xx closed [pcp-d-3:25839] mca: base: close: unloading component pmix1xx [pcp-d-3:25836] connection_handler: new connection: 32 [pcp-d-3:25836] RECV CONNECT ACK FROM PEER ON SOCKET 32 [pcp-d-3:25836] waiting for blocking recv of 16 bytes [pcp-d-3:25836] blocking receive complete from remote [pcp-d-3:25836] waiting for blocking recv of 33 bytes [pcp-d-3:25836] blocking receive complete from remote [pcp-d-3:25836] connect-ack recvd from peer 759300097:1 [pcp-d-3:25836] sec: native validate_cred 19214:5513 [pcp-d-3:25836] sec: native credential valid [pcp-d-3:25836] client credential validated [pcp-d-3:25836] send blocking of 4 bytes to socket 32 [pcp-d-3:25836] PMIX ERROR: UNREACHABLE in file /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c at line 463 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: ompi_rte_init failed --> Returned "(null)" (-43) instead of "Success" (0) -------------------------------------------------------------------------- *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [pcp-d-3:25839] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! ------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. ------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[11586,1],1] Exit code: 1 -------------------------------------------------------------------------- [pcp-d-3:25836] pmix:server finalize called [pcp-d-3:25836] listen_thread: shutdown [pcp-d-3:25836] sec: native finalize [pcp-d-3:25836] pmix:server finalize complete [pcp-d-3:25836] mca: base: close: component pmix1xx closed [pcp-d-3:25836] mca: base: close: unloading component pmix1xx