On Wed, 19 Sep 2018 16:24:53 +0000
"Gabriel, Edgar" <egabr...@central.uh.edu> wrote:

> I performed some tests on our Omnipath cluster, and I have a mixed
> bag of results with 4.0.0rc1

I've also tried it on our OPA cluster (skylake+centos-7+inbox) with
very similar results.

> compute-1-1.local.4351PSM2 has not been initialized
> compute-1-0.local.3826PSM2 has not been initialized

yup I too see these.
 
> mpirun detected that one or more processes exited with non-zero
> status, thus causing the job to be terminated. The first process to
> do so was:
> 
>               Process name: [[38418,1],1]
>               Exit code:    255
>               
> ----------------------------------------------------------------------------

yup.
 
> 
> 2.       The ofi mtl does not work at all on our Omnipath cluster. If
> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following
> error message.

Yes ofi seems broken. But not even disabling it helps me completely (I
see "mca_btl_ofi.so           [.] mca_btl_ofi_component_progress" in my
perf top...

> 3.       The openib btl component is always getting in the way with
> annoying warnings. It is not really used, but constantly complains:
...
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message
> help-mpi-btl-openib.txt / ib port not selected

Yup.

...
> So bottom line, if I do
> 
> mpirun –mca btl^openib –mca mtl^ofi ….
> 
> my tests finish correctly, although mpirun will still return an error.

I get some things to work with this approach (two ranks on two nodes
for example). But a lot of things crash rahter hard:

 $ mpirun -mca btl ^openib -mca mtl
^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1
--------------------------------------------------------------------------
PSM2 was unable to open an endpoint. Please make sure that the network
link is active on the node and the hardware is functioning.

  Error: Failure in initializing endpoint
--------------------------------------------------------------------------
n909.279895hfi_userinit: assign_context command failed: Device or
resource busy n909.279895psmi_context_open: hfi_userinit: failed,
trying again (1/3)
...
  PML add procs failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[n908:298761] *** An error occurred in MPI_Init
[n908:298761] *** reported by process [4092002305,59]
[n908:298761] *** on a NULL communicator
[n908:298761] *** Unknown error
[n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
  will now abort, [n908:298761] ***    and potentially your MPI job)
[n907:407748] 255 more processes have sent help message
  help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA
  parameter "orte_base_help_aggregate" to 0 to see all help / error
  messages [n907:407748] 127 more processes have sent help message
  help-mpi-runtime.txt / mpi_init:startup:internal-failure
  [n907:407748] 56 more processes have sent help message
  help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

If I disable psm2 too I get it to run (apparantly on vader?)

/Peter K
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to