Sorry, I missed the 4.0 on the PR (despite being the first thing in the title).

  George.


> On Sep 20, 2018, at 22:15 , Ralph H Castain <r...@open-mpi.org> wrote:
> 
> That’s why we are leaving it in master - only removing it from release branch 
> 
> Sent from my iPhone
> 
> On Sep 20, 2018, at 7:02 PM, George Bosilca <bosi...@icl.utk.edu 
> <mailto:bosi...@icl.utk.edu>> wrote:
> 
>> Why not simply ompi_ignore it ? Removing a component to bring it back later 
>> would force us to lose all history. I would a rather add an .ompi_ignore and 
>> give an opportunity to power users do continue playing with it.
>> 
>>   George.
>> 
>> 
>> On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> I already suggested the configure option, but it doesn’t solve the problem. 
>> I wouldn’t be terribly surprised to find that Cray also has an undetected 
>> problem given the nature of the issue - just a question of the amount of 
>> testing, variety of environments, etc.
>> 
>> Nobody has to wait for the next major release, though that isn’t so far off 
>> anyway - there has never been an issue with bringing in a new component 
>> during a release series.
>> 
>> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may want 
>> to look at fixing the osc/rdma/ofi bandaid as well while we are at it.
>> 
>> Ralph
>> 
>> 
>>> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon 
>>> <tpati...@vols.utk.edu <mailto:tpati...@vols.utk.edu>> wrote:
>>> 
>>> I understand and agree with your point. My initial email is just out of 
>>> curiosity.
>>> 
>>> Howard tested this BTL for Cray in the summer as well. So this seems to 
>>> only affected OPA hardware.
>>> 
>>> I just remember that in the summer, I have to make some change in libpsm2 
>>> to get this BTL to work for OPA.  Maybe this is the problem as the default 
>>> libpsm2 won't work.
>>> 
>>> So maybe we can fix this in configure step to detect version of libpsm2 and 
>>> dont build if we are not satisfied.
>>> 
>>> Another idea is maybe we dont build this BTL by default. So the user with 
>>> Cray hardware can still use it if they want. (Just rebuild with the btl)  - 
>>> We just need to verify if it still works on Cray.  This way, OFI 
>>> stakeholders does not have to wait until next major release to get this in.
>>> 
>>> 
>>> Arm
>>> 
>>> 
>>> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>> wrote:
>>> I suspect it is a question of what you tested and in which scenarios. 
>>> Problem is that it can bite someone and there isn’t a clean/obvious 
>>> solution that doesn’t require the user to do something - e.g., like having 
>>> to know that they need to disable a BTL. Matias has proposed an mca-based 
>>> approach, but I would much rather we just fix this correctly. Bandaids have 
>>> a habit of becoming permanently forgotten - until someone pulls on it and 
>>> things unravel.
>>> 
>>> 
>>>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon 
>>>> <tpati...@vols.utk.edu <mailto:tpati...@vols.utk.edu>> wrote:
>>>> 
>>>> In the summer, I tested this BTL with along with the MTL and able to use 
>>>> both of them interchangeably with no problem. I dont know what changed. 
>>>> libpsm2?
>>>> 
>>>> 
>>>> Arm
>>>> 
>>>> 
>>>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org 
>>>> <mailto:r...@open-mpi.org>> wrote:
>>>> We have too many discussion threads overlapping on the same email chain - 
>>>> so let’s break the discussion on the OFI problem into its own chain.
>>>> 
>>>> We have been investigating this locally and found there are a number of 
>>>> conflicts between the MTLs and the OFI/BTL stepping on each other. The 
>>>> correct solution is to move endpoint creation/reporting into a the 
>>>> opal/mca/common area, but that is going to take some work and will likely 
>>>> impact release schedules.
>>>> 
>>>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix 
>>>> the problem in master, and then consider bringing it back as a package to 
>>>> v4.1 or v4.2.
>>>> 
>>>> Comments? If we agree, I’ll file a PR to remove it.
>>>> Ralph
>>>> 
>>>> 
>>>>> Begin forwarded message:
>>>>> 
>>>>> From: Peter Kjellström <c...@nsc.liu.se <mailto:c...@nsc.liu.se>>
>>>>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>>>>> Date: September 20, 2018 at 5:18:35 AM PDT
>>>>> To: "Gabriel, Edgar" <egabr...@central.uh.edu 
>>>>> <mailto:egabr...@central.uh.edu>>
>>>>> Cc: Open MPI Developers <devel@lists.open-mpi.org 
>>>>> <mailto:devel@lists.open-mpi.org>>
>>>>> Reply-To: Open MPI Developers <devel@lists.open-mpi.org 
>>>>> <mailto:devel@lists.open-mpi.org>>
>>>>> 
>>>>> On Wed, 19 Sep 2018 16:24:53 +0000
>>>>> "Gabriel, Edgar" <egabr...@central.uh.edu 
>>>>> <mailto:egabr...@central.uh.edu>> wrote:
>>>>> 
>>>>>> I performed some tests on our Omnipath cluster, and I have a mixed
>>>>>> bag of results with 4.0.0rc1
>>>>> 
>>>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>>>>> very similar results.
>>>>> 
>>>>>> compute-1-1.local.4351PSM2 has not been initialized
>>>>>> compute-1-0.local.3826PSM2 has not been initialized
>>>>> 
>>>>> yup I too see these.
>>>>> 
>>>>>> mpirun detected that one or more processes exited with non-zero
>>>>>> status, thus causing the job to be terminated. The first process to
>>>>>> do so was:
>>>>>> 
>>>>>>              Process name: [[38418,1],1]
>>>>>>              Exit code:    255
>>>>>>              
>>>>>> ----------------------------------------------------------------------------
>>>>> 
>>>>> yup.
>>>>> 
>>>>>> 
>>>>>> 2.       The ofi mtl does not work at all on our Omnipath cluster. If
>>>>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>>>>>> error message.
>>>>> 
>>>>> Yes ofi seems broken. But not even disabling it helps me completely (I
>>>>> see "mca_btl_ofi.so           [.] mca_btl_ofi_component_progress" in my
>>>>> perf top...
>>>>> 
>>>>>> 3.       The openib btl component is always getting in the way with
>>>>>> annoying warnings. It is not really used, but constantly complains:
>>>>> ...
>>>>>> [sabine.cacds.uh.edu:25996 <http://sabine.cacds.uh.edu:25996/>] 1 more 
>>>>>> process has sent help message
>>>>>> help-mpi-btl-openib.txt / ib port not selected
>>>>> 
>>>>> Yup.
>>>>> 
>>>>> ...
>>>>>> So bottom line, if I do
>>>>>> 
>>>>>> mpirun –mca btl^openib –mca mtl^ofi ….
>>>>>> 
>>>>>> my tests finish correctly, although mpirun will still return an error.
>>>>> 
>>>>> I get some things to work with this approach (two ranks on two nodes
>>>>> for example). But a lot of things crash rahter hard:
>>>>> 
>>>>> $ mpirun -mca btl ^openib -mca mtl
>>>>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
>>>>> --------------------------------------------------------------------------
>>>>> PSM2 was unable to open an endpoint. Please make sure that the network
>>>>> link is active on the node and the hardware is functioning.
>>>>> 
>>>>>  Error: Failure in initializing endpoint
>>>>> --------------------------------------------------------------------------
>>>>> n909.279895hfi_userinit: assign_context command failed: Device or
>>>>> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
>>>>> trying again (1/3)
>>>>> ...
>>>>>  PML add procs failed
>>>>>  --> Returned "Error" (-1) instead of "Success" (0)
>>>>> --------------------------------------------------------------------------
>>>>> [n908:298761] *** An error occurred in MPI_Init
>>>>> [n908:298761] *** reported by process [4092002305,59]
>>>>> [n908:298761] *** on a NULL communicator
>>>>> [n908:298761] *** Unknown error
>>>>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>>>>  will now abort, [n908:298761] ***    and potentially your MPI job)
>>>>> [n907:407748] 255 more processes have sent help message
>>>>>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>>>>>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>>>>>  messages [n907:407748] 127 more processes have sent help message
>>>>>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>>>>>  [n907:407748] 56 more processes have sent help message
>>>>>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>>>>> 
>>>>> If I disable psm2 too I get it to run (apparantly on vader?)
>>>>> 
>>>>> /Peter K
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>>> https://lists.open-mpi.org/mailman/listinfo/devel 
>>>>> <https://lists.open-mpi.org/mailman/listinfo/devel>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>> https://lists.open-mpi.org/mailman/listinfo/devel 
>>>> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>>> https://lists.open-mpi.org/mailman/listinfo/devel 
>>>> <https://lists.open-mpi.org/mailman/listinfo/devel>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/devel 
>>> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://lists.open-mpi.org/mailman/listinfo/devel 
>>> <https://lists.open-mpi.org/mailman/listinfo/devel>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel 
>> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel 
>> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to