In the summer, I tested this BTL with along with the MTL and able to use
both of them interchangeably with no problem. I dont know what changed.
libpsm2?


Arm


On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org> wrote:

> We have too many discussion threads overlapping on the same email chain -
> so let’s break the discussion on the OFI problem into its own chain.
>
> We have been investigating this locally and found there are a number of
> conflicts between the MTLs and the OFI/BTL stepping on each other. The
> correct solution is to move endpoint creation/reporting into a the
> opal/mca/common area, but that is going to take some work and will likely
> impact release schedules.
>
> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix
> the problem in master, and then consider bringing it back as a package to
> v4.1 or v4.2.
>
> Comments? If we agree, I’ll file a PR to remove it.
> Ralph
>
>
> Begin forwarded message:
>
> *From: *Peter Kjellström <c...@nsc.liu.se>
> *Subject: **Re: [OMPI devel] Announcing Open MPI v4.0.0rc1*
> *Date: *September 20, 2018 at 5:18:35 AM PDT
> *To: *"Gabriel, Edgar" <egabr...@central.uh.edu>
> *Cc: *Open MPI Developers <devel@lists.open-mpi.org>
> *Reply-To: *Open MPI Developers <devel@lists.open-mpi.org>
>
> On Wed, 19 Sep 2018 16:24:53 +0000
> "Gabriel, Edgar" <egabr...@central.uh.edu> wrote:
>
> I performed some tests on our Omnipath cluster, and I have a mixed
> bag of results with 4.0.0rc1
>
>
> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
> very similar results.
>
> compute-1-1.local.4351PSM2 has not been initialized
> compute-1-0.local.3826PSM2 has not been initialized
>
>
> yup I too see these.
>
> mpirun detected that one or more processes exited with non-zero
> status, thus causing the job to be terminated. The first process to
> do so was:
>
>              Process name: [[38418,1],1]
>              Exit code:    255
>
>              
> ----------------------------------------------------------------------------
>
>
> yup.
>
>
> 2.       The ofi mtl does not work at all on our Omnipath cluster. If
> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
> error message.
>
>
> Yes ofi seems broken. But not even disabling it helps me completely (I
> see "mca_btl_ofi.so           [.] mca_btl_ofi_component_progress" in my
> perf top...
>
> 3.       The openib btl component is always getting in the way with
> annoying warnings. It is not really used, but constantly complains:
>
> ...
>
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
> help-mpi-btl-openib.txt / ib port not selected
>
>
> Yup.
>
> ...
>
> So bottom line, if I do
>
> mpirun –mca btl^openib –mca mtl^ofi ….
>
> my tests finish correctly, although mpirun will still return an error.
>
>
> I get some things to work with this approach (two ranks on two nodes
> for example). But a lot of things crash rahter hard:
>
> $ mpirun -mca btl ^openib -mca mtl
> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
> --------------------------------------------------------------------------
> PSM2 was unable to open an endpoint. Please make sure that the network
> link is active on the node and the hardware is functioning.
>
>  Error: Failure in initializing endpoint
> --------------------------------------------------------------------------
> n909.279895hfi_userinit: assign_context command failed: Device or
> resource busy n909.279895psmi_context_open: hfi_userinit: failed,
> trying again (1/3)
> ...
>  PML add procs failed
>  --> Returned "Error" (-1) instead of "Success" (0)
> --------------------------------------------------------------------------
> [n908:298761] *** An error occurred in MPI_Init
> [n908:298761] *** reported by process [4092002305,59]
> [n908:298761] *** on a NULL communicator
> [n908:298761] *** Unknown error
> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>  will now abort, [n908:298761] ***    and potentially your MPI job)
> [n907:407748] 255 more processes have sent help message
>  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
>  parameter "orte_base_help_aggregate" to 0 to see all help / error
>  messages [n907:407748] 127 more processes have sent help message
>  help-mpi-runtime.txt / mpi_init:startup:internal-failure
>  [n907:407748] 56 more processes have sent help message
>  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
>
> If I disable psm2 too I get it to run (apparantly on vader?)
>
> /Peter K
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to