Ralph,

Still failing with that patch, but with the addition of a fairly long pause
(64s) before the first error message appears, and again after the second
"server setup_fork" (64s again)

New output is attached.

-Paul

On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Argh - found a typo in the output line. Could you please try the attached
> patch and do it again? This might fix it, but if not it will provide me
> with some idea of the returned error.
>
> Thanks
> Ralph
>
>
> On Sep 20, 2015, at 12:40 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
>
> Yes, it is definitely at 10.
> Another attempt is attached.
> -Paul
>
> On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Paul - can you please confirm that you gave mpirun a level of 10 for the
>> pmix_base_verbose param? This output isn’t what I would have expected from
>> that level - it looks more like the verbosity was set to 5, and so the
>> error number isn’t printed.
>>
>> Thanks
>> Ralph
>>
>>
>> On Sep 20, 2015, at 3:42 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>> Paul,
>>
>> I do not remember it like that ...
>>
>> at that time, the issue in ompi was that the global errno was uses
>> instead of the per thread errno.
>> though the man pages tells -mt should be used fir multithreaded apps, you
>> tried -D_REENTRANT on all your platforms, and it was enough to get the
>> expected result.
>>
>> I just wanted to check pmix1xx (sub)configure did correctly pass the
>> -D_REENTRANT flag, and it does. so this is very likely a new and unrelated
>> error
>>
>> Cheers,
>>
>> Gilles
>>
>> On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov> wrote:
>>
>>> Gilles,
>>>
>>> Yes every $CC invocation in opal/mca/pmix/pmix1xx includes
>>> "-D_REENTRANT".
>>> However, they don't include "-mt".
>>> I believe we concluded (when we had problems previously) that "-mt" was
>>> the proper flag (at compile and link) for multi-threaded with the Studio
>>> compilers.
>>>
>>> -Paul
>>>
>>> On Sat, Sep 19, 2015 at 11:29 PM, Gilles Gouaillardet <
>>> gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> Paul,
>>>>
>>>> Can you please double check pmix1xx is compiled with -D_REENTRANT ?
>>>> We ran into similar issues in the past, and they only occurred with
>>>> Solaris
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>>
>>>>
>>>> On Sunday, September 20, 2015, Paul Hargrove <phhargr...@lbl.gov>
>>>> wrote:
>>>>
>>>>> Ralph,
>>>>> The output from the requested run is attached.
>>>>> -Paul
>>>>>
>>>>> On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain <r...@open-mpi.org>
>>>>> wrote:
>>>>>
>>>>>> Ah, okay - that makes more sense. I’ll have to let Brice see if he
>>>>>> can figure out how to silence the hwloc error message as I can’t find 
>>>>>> where
>>>>>> it came from. The other errors are real and are the reason why the job 
>>>>>> was
>>>>>> terminated.
>>>>>>
>>>>>> The problem is that we are trying to establish a communication
>>>>>> between the app and the daemon via unix domain socket, and we failed to 
>>>>>> do
>>>>>> so. The error tells me that we were able to create and connect to the
>>>>>> socket, but failed when the daemon tried to do a blocking send to the 
>>>>>> app.
>>>>>>
>>>>>> Can you rerun it with -mca pmix_base_verbose 10? It will tell us the
>>>>>> value of the error number that was returned
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>>
>>>>>> On Sep 19, 2015, at 9:37 PM, Paul Hargrove <phhargr...@lbl.gov>
>>>>>> wrote:
>>>>>>
>>>>>> Ralph,
>>>>>>
>>>>>> No it did not run.
>>>>>> The complete output (which I really should have included in the first
>>>>>> place) is below.
>>>>>>
>>>>>> -Paul
>>>>>>
>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>>>>>> Error opening /devices/pci@0,0:reg: Permission denied
>>>>>> [pcp-d-3:26054] PMIX ERROR: ERROR in file
>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
>>>>>> at line 181
>>>>>> [pcp-d-3:26053] PMIX ERROR: UNREACHABLE in file
>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x64-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>>>>> at line 463
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> It looks like MPI_INIT failed for some reason; your parallel process
>>>>>> is
>>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>>> fail during MPI_INIT; some of which are due to configuration or
>>>>>> environment
>>>>>> problems.  This failure appears to be an internal failure; here's some
>>>>>> additional information (which may only be relevant to an Open MPI
>>>>>> developer):
>>>>>>
>>>>>>   ompi_mpi_init: ompi_rte_init failed
>>>>>>   --> Returned "(null)" (-43) instead of "Success" (0)
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> *** An error occurred in MPI_Init
>>>>>> *** on a NULL communicator
>>>>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
>>>>>> abort,
>>>>>> ***    and potentially your MPI job)
>>>>>> [pcp-d-3:26054] Local abort before MPI_INIT completed completed
>>>>>> successfully, but am not able to aggregate error messages, and not able 
>>>>>> to
>>>>>> guarantee that all other processes were killed!
>>>>>> -------------------------------------------------------
>>>>>> Primary job  terminated normally, but 1 process returned
>>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>>> -------------------------------------------------------
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>> mpirun detected that one or more processes exited with non-zero
>>>>>> status, thus causing
>>>>>> the job to be terminated. The first process to do so was:
>>>>>>
>>>>>>   Process name: [[11371,1],0]
>>>>>>   Exit code:    1
>>>>>>
>>>>>> --------------------------------------------------------------------------
>>>>>>
>>>>>> On Sat, Sep 19, 2015 at 8:50 PM, Ralph Castain <r...@open-mpi.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Paul, can you clarify something for me? The error in this case
>>>>>>> indicates that the client wasn’t able to reach the daemon - this should
>>>>>>> have resulted in termination of the job. Did the job actually run?
>>>>>>>
>>>>>>>
>>>>>>> On Sep 18, 2015, at 2:50 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>
>>>>>>> I'm on travel right now, but it should be an easy fix when I return.
>>>>>>> Sorry for the annoyance
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove <phhargr...@lbl.gov>
>>>>>>>  wrote:
>>>>>>>
>>>>>>>> Any suggestion how I (as a non-root user) can avoid seeing this
>>>>>>>> hwloc error message on every run?
>>>>>>>>
>>>>>>>> -Paul
>>>>>>>>
>>>>>>>> On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet <
>>>>>>>> gil...@rist.or.jp> wrote:
>>>>>>>>
>>>>>>>>> Paul,
>>>>>>>>>
>>>>>>>>> IIRC, the "Permission denied" is coming from hwloc that cannot
>>>>>>>>> collect all the info it would like.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Gilles
>>>>>>>>>
>>>>>>>>> On 9/18/2015 2:34 PM, Paul Hargrove wrote:
>>>>>>>>>
>>>>>>>>> Tried tonight's master tarball on Solaris 11.2 on x86-64 with the
>>>>>>>>> Studio Compilers  (default ILP32 output) and saw the following result
>>>>>>>>>
>>>>>>>>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>>>>>>>>> Error opening /devices/pci@0,0:reg: Permission denied
>>>>>>>>> [pcp-d-4:00492] PMIX ERROR: ERROR in file
>>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/client/pmix_client.c
>>>>>>>>> at line 181
>>>>>>>>> [pcp-d-4:00491] PMIX ERROR: UNREACHABLE in file
>>>>>>>>> /export/home/phargrov/OMPI/openmpi-master-solaris11-x86-ss12u3/openmpi-dev-2559-g567c9e3/opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>>>>>>>> at line 463
>>>>>>>>>
>>>>>>>>> I don't know if the Permission denied error is related to the
>>>>>>>>> subsequent PMIX errors, but any message that says "UNREACHABLE" is 
>>>>>>>>> clearly
>>>>>>>>> worth reporting.
>>>>>>>>>
>>>>>>>>> -Paul
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing listde...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> Link to this post:
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18075.php
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18076.php
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18078.php
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18080.php
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18081.php
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>>>> Computer Languages & Systems Software (CLaSS) Group
>>>>> Computer Science Department               Tel: +1-510-495-2352
>>>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18083.php
>>>>
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove                          phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department               Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
>>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/18085.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/18086.php
>>
>
>
>
> --
> Paul H. Hargrove                          phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> <typescript>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18087.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18088.php
>



-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department               Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Attachment: typescript
Description: Binary data

Reply via email to