I understand and agree with your point. My initial email is just out of
curiosity.

Howard tested this BTL for Cray in the summer as well. So this seems to
only affected OPA hardware.

I just remember that in the summer, I have to make some change in libpsm2
to get this BTL to work for OPA.  Maybe this is the problem as the default
libpsm2 won't work.

So maybe we can fix this in configure step to detect version of libpsm2 and
dont build if we are not satisfied.

Another idea is maybe we dont build this BTL by default. So the user with
Cray hardware can still use it if they want. (Just rebuild with the btl)  -
We just need to verify if it still works on Cray.  This way, OFI
stakeholders does not have to wait until next major release to get this in.


Arm


On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain <r...@open-mpi.org> wrote:

> I suspect it is a question of what you tested and in which scenarios.
> Problem is that it can bite someone and there isn’t a clean/obvious
> solution that doesn’t require the user to do something - e.g., like having
> to know that they need to disable a BTL. Matias has proposed an mca-based
> approach, but I would much rather we just fix this correctly. Bandaids have
> a habit of becoming permanently forgotten - until someone pulls on it and
> things unravel.
>
>
> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon <
> tpati...@vols.utk.edu> wrote:
>
> In the summer, I tested this BTL with along with the MTL and able to use
> both of them interchangeably with no problem. I dont know what changed.
> libpsm2?
>
>
> Arm
>
>
> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org> wrote:
>
>> We have too many discussion threads overlapping on the same email chain -
>> so let’s break the discussion on the OFI problem into its own chain.
>>
>> We have been investigating this locally and found there are a number of
>> conflicts between the MTLs and the OFI/BTL stepping on each other. The
>> correct solution is to move endpoint creation/reporting into a the
>> opal/mca/common area, but that is going to take some work and will likely
>> impact release schedules.
>>
>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
>> the problem in master, and then consider bringing it back as a package to
>> v4.1 or v4.2.
>>
>> Comments? If we agree, I’ll file a PR to remove it.
>> Ralph
>>
>>
>> Begin forwarded message:
>>
>> *From: *Peter Kjellström <c...@nsc.liu.se>
>> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
>> *Date: *September 20, 2018 at 5:18:35 AM PDT
>> *To: *"Gabriel, Edgar" <egabr...@central.uh.edu>
>> *Cc: *Open MPI Developers <devel@lists.open-mpi.org>
>> *Reply-To: *Open MPI Developers <devel@lists.open-mpi.org>
>>
>> On Wed, 19 Sep 2018 16:24:53 +0000
>> "Gabriel, Edgar" <egabr...@central.uh.edu> wrote:
>>
>> I performed some tests on our Omnipath cluster, and I have a mixed
>> bag of results with 4.0.0rc1
>>
>>
>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
>> very similar results.
>>
>> compute-1-1.local.4351PSM2 has not been initialized
>> compute-1-0.local.3826PSM2 has not been initialized
>>
>>
>> yup I too see these.
>>
>> mpirun detected that one or more processes exited with non-zero
>> status, thus causing the job to be terminated. The first process to
>> do so was:
>>
>>              Process name: [[38418,1],1]
>>              Exit code:    255
>>
>>              
>> ----------------------------------------------------------------------------
>>
>>
>> yup.
>>
>>
>> 2.       The ofi mtl does not work at all on our Omnipath cluster. If
>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
>> error message.
>>
>>
>> Yes ofi seems broken. But not even disabling it helps me completely (I
>> see "mca_btl_ofi.so           [.] mca_btl_ofi_component_progress" in my
>> perf top...
>>
>> 3.       The openib btl component is always getting in the way with
>> annoying warnings. It is not really used, but constantly complains:
>>
>> ...
>>
>> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
>> help-mpi-btl-openib.txt / ib port not selected
>>
>>
>> Yup.
>>
>> ...
>>
>> So bottom line, if I do
>>
>> mpirun –mca btl^openib –mca mtl^ofi ….
>>
>> my tests finish correctly, although mpirun will still return an error.
>>
>>
>> I get some things to work with this approach (two ranks on two nodes
>> for example). But a lot of things crash rahter hard:
>>
>> $ mpirun -mca btl ^openib -mca mtl
>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
>> --------------------------------------------------------------------------
>> PSM2 was unable to open an endpoint. Please make sure that the network
>> link is active on the node and the hardware is functioning.
>>
>>  Error: Failure in initializing endpoint
>> --------------------------------------------------------------------------
>> n909.279895hfi_userinit: assign_context command failed: Device or
>> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
>> trying again (1/3)
>> ...
>>  PML add procs failed
>>  --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [n908:298761] *** An error occurred in MPI_Init
>> [n908:298761] *** reported by process [4092002305,59]
>> [n908:298761] *** on a NULL communicator
>> [n908:298761] *** Unknown error
>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>>  will now abort, [n908:298761] ***    and potentially your MPI job)
>> [n907:407748] 255 more processes have sent help message
>>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>>  messages [n907:407748] 127 more processes have sent help message
>>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>>  [n907:407748] 56 more processes have sent help message
>>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>>
>> If I disable psm2 too I get it to run (apparantly on vader?)
>>
>> /Peter K
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to