Sorry, I missed the 4.0 on the PR (despite being the first thing in the title).
George. > On Sep 20, 2018, at 22:15 , Ralph H Castain <r...@open-mpi.org> wrote: > > That’s why we are leaving it in master - only removing it from release branch > > Sent from my iPhone > > On Sep 20, 2018, at 7:02 PM, George Bosilca <bosi...@icl.utk.edu > <mailto:bosi...@icl.utk.edu>> wrote: > >> Why not simply ompi_ignore it ? Removing a component to bring it back later >> would force us to lose all history. I would a rather add an .ompi_ignore and >> give an opportunity to power users do continue playing with it. >> >> George. >> >> >> On Thu, Sep 20, 2018 at 8:04 PM Ralph H Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> I already suggested the configure option, but it doesn’t solve the problem. >> I wouldn’t be terribly surprised to find that Cray also has an undetected >> problem given the nature of the issue - just a question of the amount of >> testing, variety of environments, etc. >> >> Nobody has to wait for the next major release, though that isn’t so far off >> anyway - there has never been an issue with bringing in a new component >> during a release series. >> >> Let’s just fix this the right way and bring it into 4.1 or 4.2. We may want >> to look at fixing the osc/rdma/ofi bandaid as well while we are at it. >> >> Ralph >> >> >>> On Sep 20, 2018, at 4:45 PM, Patinyasakdikul, Thananon >>> <tpati...@vols.utk.edu <mailto:tpati...@vols.utk.edu>> wrote: >>> >>> I understand and agree with your point. My initial email is just out of >>> curiosity. >>> >>> Howard tested this BTL for Cray in the summer as well. So this seems to >>> only affected OPA hardware. >>> >>> I just remember that in the summer, I have to make some change in libpsm2 >>> to get this BTL to work for OPA. Maybe this is the problem as the default >>> libpsm2 won't work. >>> >>> So maybe we can fix this in configure step to detect version of libpsm2 and >>> dont build if we are not satisfied. >>> >>> Another idea is maybe we dont build this BTL by default. So the user with >>> Cray hardware can still use it if they want. (Just rebuild with the btl) - >>> We just need to verify if it still works on Cray. This way, OFI >>> stakeholders does not have to wait until next major release to get this in. >>> >>> >>> Arm >>> >>> >>> On Thu, Sep 20, 2018, 7:18 PM Ralph H Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> I suspect it is a question of what you tested and in which scenarios. >>> Problem is that it can bite someone and there isn’t a clean/obvious >>> solution that doesn’t require the user to do something - e.g., like having >>> to know that they need to disable a BTL. Matias has proposed an mca-based >>> approach, but I would much rather we just fix this correctly. Bandaids have >>> a habit of becoming permanently forgotten - until someone pulls on it and >>> things unravel. >>> >>> >>>> On Sep 20, 2018, at 4:14 PM, Patinyasakdikul, Thananon >>>> <tpati...@vols.utk.edu <mailto:tpati...@vols.utk.edu>> wrote: >>>> >>>> In the summer, I tested this BTL with along with the MTL and able to use >>>> both of them interchangeably with no problem. I dont know what changed. >>>> libpsm2? >>>> >>>> >>>> Arm >>>> >>>> >>>> On Thu, Sep 20, 2018, 7:06 PM Ralph H Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> We have too many discussion threads overlapping on the same email chain - >>>> so let’s break the discussion on the OFI problem into its own chain. >>>> >>>> We have been investigating this locally and found there are a number of >>>> conflicts between the MTLs and the OFI/BTL stepping on each other. The >>>> correct solution is to move endpoint creation/reporting into a the >>>> opal/mca/common area, but that is going to take some work and will likely >>>> impact release schedules. >>>> >>>> Accordingly, we propose to remove the OFI/BTL component from v4.0.0, fix >>>> the problem in master, and then consider bringing it back as a package to >>>> v4.1 or v4.2. >>>> >>>> Comments? If we agree, I’ll file a PR to remove it. >>>> Ralph >>>> >>>> >>>>> Begin forwarded message: >>>>> >>>>> From: Peter Kjellström <c...@nsc.liu.se <mailto:c...@nsc.liu.se>> >>>>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1 >>>>> Date: September 20, 2018 at 5:18:35 AM PDT >>>>> To: "Gabriel, Edgar" <egabr...@central.uh.edu >>>>> <mailto:egabr...@central.uh.edu>> >>>>> Cc: Open MPI Developers <devel@lists.open-mpi.org >>>>> <mailto:devel@lists.open-mpi.org>> >>>>> Reply-To: Open MPI Developers <devel@lists.open-mpi.org >>>>> <mailto:devel@lists.open-mpi.org>> >>>>> >>>>> On Wed, 19 Sep 2018 16:24:53 +0000 >>>>> "Gabriel, Edgar" <egabr...@central.uh.edu >>>>> <mailto:egabr...@central.uh.edu>> wrote: >>>>> >>>>>> I performed some tests on our Omnipath cluster, and I have a mixed >>>>>> bag of results with 4.0.0rc1 >>>>> >>>>> I've also tried it on our OPA cluster (skylake+centos-7+inbox) with >>>>> very similar results. >>>>> >>>>>> compute-1-1.local.4351PSM2 has not been initialized >>>>>> compute-1-0.local.3826PSM2 has not been initialized >>>>> >>>>> yup I too see these. >>>>> >>>>>> mpirun detected that one or more processes exited with non-zero >>>>>> status, thus causing the job to be terminated. The first process to >>>>>> do so was: >>>>>> >>>>>> Process name: [[38418,1],1] >>>>>> Exit code: 255 >>>>>> >>>>>> ---------------------------------------------------------------------------- >>>>> >>>>> yup. >>>>> >>>>>> >>>>>> 2. The ofi mtl does not work at all on our Omnipath cluster. If >>>>>> I try to force it using ‘mpirun –mca mtl ofi …’ I get the following >>>>>> error message. >>>>> >>>>> Yes ofi seems broken. But not even disabling it helps me completely (I >>>>> see "mca_btl_ofi.so [.] mca_btl_ofi_component_progress" in my >>>>> perf top... >>>>> >>>>>> 3. The openib btl component is always getting in the way with >>>>>> annoying warnings. It is not really used, but constantly complains: >>>>> ... >>>>>> [sabine.cacds.uh.edu:25996 <http://sabine.cacds.uh.edu:25996/>] 1 more >>>>>> process has sent help message >>>>>> help-mpi-btl-openib.txt / ib port not selected >>>>> >>>>> Yup. >>>>> >>>>> ... >>>>>> So bottom line, if I do >>>>>> >>>>>> mpirun –mca btl^openib –mca mtl^ofi …. >>>>>> >>>>>> my tests finish correctly, although mpirun will still return an error. >>>>> >>>>> I get some things to work with this approach (two ranks on two nodes >>>>> for example). But a lot of things crash rahter hard: >>>>> >>>>> $ mpirun -mca btl ^openib -mca mtl >>>>> ^ofi ./openmpi-4.0.0rc1/imb.openmpi-4.0.0rc1 >>>>> -------------------------------------------------------------------------- >>>>> PSM2 was unable to open an endpoint. Please make sure that the network >>>>> link is active on the node and the hardware is functioning. >>>>> >>>>> Error: Failure in initializing endpoint >>>>> -------------------------------------------------------------------------- >>>>> n909.279895hfi_userinit: assign_context command failed: Device or >>>>> resource busy n909.279895psmi_context_open: hfi_userinit: failed, >>>>> trying again (1/3) >>>>> ... >>>>> PML add procs failed >>>>> --> Returned "Error" (-1) instead of "Success" (0) >>>>> -------------------------------------------------------------------------- >>>>> [n908:298761] *** An error occurred in MPI_Init >>>>> [n908:298761] *** reported by process [4092002305,59] >>>>> [n908:298761] *** on a NULL communicator >>>>> [n908:298761] *** Unknown error >>>>> [n908:298761] *** MPI_ERRORS_ARE_FATAL (processes in this communicator >>>>> will now abort, [n908:298761] *** and potentially your MPI job) >>>>> [n907:407748] 255 more processes have sent help message >>>>> help-mtl-psm2.txt / unable to open endpoint [n907:407748] Set MCA >>>>> parameter "orte_base_help_aggregate" to 0 to see all help / error >>>>> messages [n907:407748] 127 more processes have sent help message >>>>> help-mpi-runtime.txt / mpi_init:startup:internal-failure >>>>> [n907:407748] 56 more processes have sent help message >>>>> help-mpi-errors.txt / mpi_errors_are_fatal unknown handle >>>>> >>>>> If I disable psm2 too I get it to run (apparantly on vader?) >>>>> >>>>> /Peter K >>>>> _______________________________________________ >>>>> devel mailing list >>>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>>> https://lists.open-mpi.org/mailman/listinfo/devel >>>>> <https://lists.open-mpi.org/mailman/listinfo/devel> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/devel >>>> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>>> https://lists.open-mpi.org/mailman/listinfo/devel >>>> <https://lists.open-mpi.org/mailman/listinfo/devel> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/devel >>> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://lists.open-mpi.org/mailman/listinfo/devel >>> <https://lists.open-mpi.org/mailman/listinfo/devel> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/devel >> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://lists.open-mpi.org/mailman/listinfo/devel >> <https://lists.open-mpi.org/mailman/listinfo/devel>_______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel