Ralph, Still failing with that patch, but with the addition of a fairly long pause (64s) before the first error message appears, and again after the second "server setup_fork" (64s again)
New output is attached. -Paul On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote: > Argh - found a typo in the output line. Could you please try the attached > patch and do it again? This might fix it, but if not it will provide me > with some idea of the returned error. > > Thanks > Ralph > > > On Sep 20, 2015, at 12:40 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Yes, it is definitely at 10. > Another attempt is attached. > -Paul > > On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Paul - can you please confirm that you gave mpirun a level of 10 for the >> pmix_base_verbose param? This output isn’t what I would have expected from >> that level - it looks more like the verbosity was set to 5, and so the >> error number isn’t printed. >> >> Thanks >> Ralph >> >> >> On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >> Paul, >> >> I do not remember it like that ... >> >> at that time, the issue in ompi was that the global errno was uses >> instead of the per thread errno. >> though the man pages tells -mt should be used fir multithreaded apps, you >> tried -D_REENTRANT on all your platforms, and it was enough to get the >> expected result. >> >> I just wanted to check pmix1xx (sub)configure did correctly pass the >> -D_REENTRANT flag, and it does. so this is very likely a new and unrelated >> error >> >> Cheers, >> >> Gilles >> >> On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote: >> >>> Gilles, >>> >>> Yes every $CC invocation in opal/mca/pmix/pmix1xx includes >>> "-D_REENTRANT". >>> However, they don't include "-mt". >>> I believe we concluded (when we had problems previously) that "-mt" was >>> the proper flag (at compile and link) for multi-threaded with the Studio >>> compilers. >>> >>> -Paul >>> >>> On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet < >>> gilles.gouaillar...@gmail.com> wrote: >>> >>>> Paul, >>>> >>>> Can you please double check pmix1xx is compiled with -D_REENTRANT ? >>>> We ran into similar issues in the past, and they only occurred with >>>> Solaris >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> >>>> On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> >>>> wrote: >>>> >>>>> Ralph, >>>>> The output from the requested run is attached. >>>>> -Paul >>>>> >>>>> On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>>> Ah, okay - that makes more sense. I’ll have to let Brice see if he >>>>>> can figure out how to silence the hwloc error message as I can’t find >>>>>> where >>>>>> it came from. The other errors are real and are the reason why the job >>>>>> was >>>>>> terminated. >>>>>> >>>>>> The problem is that we are trying to establish a communication >>>>>> between the app and the daemon via unix domain socket, and we failed to >>>>>> do >>>>>> so. The error tells me that we were able to create and connect to the >>>>>> socket, but failed when the daemon tried to do a blocking send to the >>>>>> app. >>>>>> >>>>>> Can you rerun it with -mca pmix_base_verbose 10? It will tell us the >>>>>> value of the error number that was returned >>>>>> >>>>>> Thanks >>>>>> Ralph >>>>>> >>>>>> >>>>>> On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov> >>>>>> wrote: >>>>>> >>>>>> Ralph, >>>>>> >>>>>> No it did not run. >>>>>> The complete output (which I really should have included in the first >>>>>> place) is below. >>>>>> >>>>>> -Paul >>>>>> >>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >>>>>> Error opening /devices/pci@0,0:reg: Permission denied >>>>>> [pcp-d-3:26054] PMIX ERROR: ERROR in file >>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c >>>>>> at line 181 >>>>>> [pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file >>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>> at line 463 >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> It looks like MPI_INIT failed for some reason; your parallel process >>>>>> is >>>>>> likely to abort. There are many reasons that a parallel process can >>>>>> fail during MPI_INIT; some of which are due to configuration or >>>>>> environment >>>>>> problems. This failure appears to be an internal failure; here's some >>>>>> additional information (which may only be relevant to an Open MPI >>>>>> developer): >>>>>> >>>>>> ompi_mpi_init: ompi_rte_init failed >>>>>> --> Returned "(null)" (-43) instead of "Success" (0) >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> *** An error occurred in MPI_Init >>>>>> *** on a NULL communicator >>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now >>>>>> abort, >>>>>> *** and potentially your MPI job) >>>>>> [pcp-d-3:26054] Local abort before MPI_INIT completed completed >>>>>> successfully, but am not able to aggregate error messages, and not able >>>>>> to >>>>>> guarantee that all other processes were killed! >>>>>> ------------------------------------------------------- >>>>>> Primary job terminated normally, but 1 process returned >>>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>>> ------------------------------------------------------- >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> mpirun detected that one or more processes exited with non-zero >>>>>> status, thus causing >>>>>> the job to be terminated. The first process to do so was: >>>>>> >>>>>> Process name: [[11371,1],0] >>>>>> Exit code: 1 >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> >>>>>> On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org> >>>>>> wrote: >>>>>> >>>>>>> Paul, can you clarify something for me? The error in this case >>>>>>> indicates that the client wasn’t able to reach the daemon - this should >>>>>>> have resulted in termination of the job. Did the job actually run? >>>>>>> >>>>>>> >>>>>>> On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> >>>>>>> I'm on travel right now, but it should be an easy fix when I return. >>>>>>> Sorry for the annoyance >>>>>>> >>>>>>> >>>>>>> On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove <phhargr...@lbl.gov> >>>>>>> wrote: >>>>>>> >>>>>>>> Any suggestion how I (as a non-root user) can avoid seeing this >>>>>>>> hwloc error message on every run? >>>>>>>> >>>>>>>> -Paul >>>>>>>> >>>>>>>> On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet < >>>>>>>> gil...@rist.or.jp> wrote: >>>>>>>> >>>>>>>>> Paul, >>>>>>>>> >>>>>>>>> IIRC, the "Permission denied" is coming from hwloc that cannot >>>>>>>>> collect all the info it would like. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Gilles >>>>>>>>> >>>>>>>>> On 9/18/2015 2:34 PM, Paul Hargrove wrote: >>>>>>>>> >>>>>>>>> Tried tonight's master tarball on Solaris 11.2 on x86-64 with the >>>>>>>>> Studio Compilers (default ILP32 output) and saw the following result >>>>>>>>> >>>>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c' >>>>>>>>> Error opening /devices/pci@0,0:reg: Permission denied >>>>>>>>> [pcp-d-4:00492] PMIX ERROR: ERROR in file >>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c >>>>>>>>> at line 181 >>>>>>>>> [pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file >>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>>>>> at line 463 >>>>>>>>> >>>>>>>>> I don't know if the Permission denied error is related to the >>>>>>>>> subsequent PMIX errors, but any message that says "UNREACHABLE" is >>>>>>>>> clearly >>>>>>>>> worth reporting. >>>>>>>>> >>>>>>>>> -Paul >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing listde...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18075.php >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18076.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18078.php >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>>> Computer Languages & Systems Software (CLaSS) Group >>>>>> Computer Science Department Tel: +1-510-495-2352 >>>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18080.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18081.php >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Paul H. Hargrove phhargr...@lbl.gov >>>>> Computer Languages & Systems Software (CLaSS) Group >>>>> Computer Science Department Tel: +1-510-495-2352 >>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/09/18083.php >>>> >>> >>> >>> >>> -- >>> Paul H. Hargrove phhargr...@lbl.gov >>> Computer Languages & Systems Software (CLaSS) Group >>> Computer Science Department Tel: +1-510-495-2352 >>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 >>> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/18085.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/18086.php >> > > > > -- > Paul H. Hargrove phhargr...@lbl.gov > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > <typescript>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18087.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18088.php > -- Paul H. Hargrove phhargr...@lbl.gov Computer Languages & Systems Software (CLaSS) Group Computer Science Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
typescript
Description: Binary data