On Wed, 19 Sep 2018 16:24:53 +0000 "Gabriel, Edgar" <egabr...@central.uh.edu> wrote:
> I performed some tests on our Omnipath cluster, and I have a mixed > bag of results with 4.0.0rc1 I've also tried it on our OPA cluster (skylake+centos-7+inbox) with very similar results. > compute-1-1.local.4351PSM2 has not been initialized > compute-1-0.local.3826PSM2 has not been initialized yup I too see these. > mpirun detected that one or more processes exited with non-zero > status, thus causing the job to be terminated. The first process to > do so was: > > Process name: [[38418,1],1] > Exit code: 255 > > ---------------------------------------------------------------------------- yup. > > 2. The ofi mtl does not work at all on our Omnipath cluster. If > I try to force it using ‘mpirun –mca mtl ofi …’ I get the following > error message. Yes ofi seems broken. But not even disabling it helps me completely (I see "mca_btl_ofi.so [.] mca_btl_ofi_component_progress" in my perf top... > 3. The openib btl component is always getting in the way with > annoying warnings. It is not really used, but constantly complains: ... > [sabine.cacds.uh.edu:25996] 1 more process has sent help message > help-mpi-btl-openib.txt / ib port not selected Yup. ... > So bottom line, if I do > > mpirun –mca btl^openib –mca mtl^ofi …. > > my tests finish correctly, although mpirun will still return an error. I get some things to work with this approach (two ranks on two nodes for example). But a lot of things crash rahter hard: $ mpirun -mca btl ^openib -mca mtl ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1 -------------------------------------------------------------------------- PSM2 was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning. Error: Failure in initializing endpoint -------------------------------------------------------------------------- n909.279895hfi_userinit: assign_context command failed: Device or resource busy n909.279895psmi_context_open: hfi_userinit: failed, trying again (1/3) ... PML add procs failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- [n908:298761] *** An error occurred in MPI_Init [n908:298761] *** reported by process [4092002305,59] [n908:298761] *** on a NULL communicator [n908:298761] *** Unknown error [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [n908:298761] *** and potentially your MPI job) [n907:407748] 255 more processes have sent help message help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [n907:407748] 127 more processes have sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure [n907:407748] 56 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle If I disable psm2 too I get it to run (apparantly on vader?) /Peter K _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel