Open MPI Meeting 12/1/2015
--- Attendees --------------
Geoff Paulsen
Jeff Squyres
Geoffroy Valee
Howard
Ryan Grant
Sylvain Jeaugey - new nVidia contact (replaces Rolf) previously at Bull Computing (10 years) lives in Santa Clara.
Todd Kordenbrock
Geoff Paulsen
Jeff Squyres
Geoffroy Valee
Howard
Ryan Grant
Sylvain Jeaugey - new nVidia contact (replaces Rolf) previously at Bull Computing (10 years) lives in Santa Clara.
Todd Kordenbrock
Agenda:
- Solicit volunteer to run the weekly telecon
- Review 1.10
o Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
+ one PR for 1.10.2 (PR 782)
+ Need someone to clarify on this, to resolve.
+ After we decide if it's right, A core developer will need to create PR for Master.
+ Rest of PRs are for 1.10.3 (March or April 2016?)
o When do we want to start release work for 1.10.2?
+ How about a 1.10.2 Release Candidate before the holidays?
+ Ralph will send email about this to dev list to solicit discussion.
- Review 2.x
o Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
o Blocker issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
+ 1064 - Ralph / Jeff, is this do-able by december?
+ Dynamic add procs is busted now when set value to 0 (not related to PMI-x)
o Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
+ One of us will go through ALL Issues for 2.0.0 to ask if they can be moved out to future release.
o RFC on embedded PMIx version handling
+ PMI-x, once it's stablized, treat it just like hwloc or libevent.
+ PMI-x would have seperate releases, and if Open MPI needs to cherry pick specific releases.
+ PMI-x when it has a new release, we'll create a new directory and validate when it's ready to go remove older ones.
+ PMI-x Ralph will create a tarball for that.
+ PMI-x - Needs to be in 2.0.0, need to update it and go to right naming convention while do that.
o RFC process - Every so often there are 'big deal' issues / PR requests. It's hard to spot these BIG ones.
+ Ralph proposing that if you're making a major change or change to core-code:
Send RFC to devel list before you do it! (and again with PR when it's ready, put "RFC" in PR title.)
+ Good idea to send out RFC before you start to do it, then others can give a heads up or comment.
+ Prevent potential conflicts of parallel development.
Howard - Nice to have affected components, and reason for wanting change.
Jeff - Had a nice format for RFC's before. Short / Long versions. Might want to nail down.
Jeff - Propose we put "RFC" in PR title.
Jeff - should the body and format be in PR
Discussion about proposed work should be on devel email.
Discussion about already written code is on PR, and
Jeff proposes a wiki page describing this process.
Where - what does it affect.
When - when can we discuss? Give at least 1 week for others to reply.
What - summary
Why - Some justification, better than "I was board".
Down below deeper discussion.
o Supercomputing reports
+ OMPI BoF went well. Over 100 people in room. Slides on OMPI website, and on Jeff's Blog.
+ People appreciated the BoF format of "status, roadmap, what's going well, what needs more attention, etc"
+ PMI-x Bof Went well too. Scaling improvements went REALLY well.
+ PMI-x showed really good slope, they thought it was wire up times of daemons.
Mellanox needs to remove requirement to remove LID and GID, but still like a year
o Status Update: Mellanox, Sandia, Intel
+ Mellanox (via Ralph)
1. Artem will be working with Ralph et al. to finish off the OMPI side issues in PMIx integration.
2. Igor Ivanov will continue to fix memory corruption bugs uncovered in Valgrind.
3. Artem and Igor will start looking at making the necessary changes to UCX PML to use the direct modex.
4. Mellanox plans to submit UCX PML for inclusion in 1.10.3.
5. Mellanox plans to submit missing routines needed for OSHMEM 1.2 spec compliance for inclusion in 1.10.3. Igor Ivanov will be leading this.
+ Sandia (ryan Grant)
- Put Portals triggered Ops on master. Will run tests there for a while and then put PR for 2.0 branch.
+ Intel (Ralph)
- PMI-x Working on Pull Requests.
- HPC stuff occupying alot of his time. Announcing Open HPC to create a community distributions optimized for HPC.
- Building on top of OPAL.
o Howard has request for Slyvan / nVidia
+ Slyvan stopped Rolf's MTT yesterday, hoping to have it back by end of the week.
+ MLX5 HCAs - on master there are lots of errors, not sure because of software.
+ Nvidia cluster shows up bugs before other clusters.
+ Right now master under defaults running really clean. But turning on dynamic add-procs is showing lots of issue in Comm_dup, and other Comm creation code.
--- Status Update Rotation ----
LANL, Houston, HLRS, IBM
Cisco, ORNL, UTK, NVIDIA
Mellanox, Sandia, Intel
- Solicit volunteer to run the weekly telecon
- Review 1.10
o Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
+ one PR for 1.10.2 (PR 782)
+ Need someone to clarify on this, to resolve.
+ After we decide if it's right, A core developer will need to create PR for Master.
+ Rest of PRs are for 1.10.3 (March or April 2016?)
o When do we want to start release work for 1.10.2?
+ How about a 1.10.2 Release Candidate before the holidays?
+ Ralph will send email about this to dev list to solicit discussion.
- Review 2.x
o Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20
o Blocker issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker
+ 1064 - Ralph / Jeff, is this do-able by december?
+ Dynamic add procs is busted now when set value to 0 (not related to PMI-x)
o Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0
+ One of us will go through ALL Issues for 2.0.0 to ask if they can be moved out to future release.
o RFC on embedded PMIx version handling
+ PMI-x, once it's stablized, treat it just like hwloc or libevent.
+ PMI-x would have seperate releases, and if Open MPI needs to cherry pick specific releases.
+ PMI-x when it has a new release, we'll create a new directory and validate when it's ready to go remove older ones.
+ PMI-x Ralph will create a tarball for that.
+ PMI-x - Needs to be in 2.0.0, need to update it and go to right naming convention while do that.
o RFC process - Every so often there are 'big deal' issues / PR requests. It's hard to spot these BIG ones.
+ Ralph proposing that if you're making a major change or change to core-code:
Send RFC to devel list before you do it! (and again with PR when it's ready, put "RFC" in PR title.)
+ Good idea to send out RFC before you start to do it, then others can give a heads up or comment.
+ Prevent potential conflicts of parallel development.
Howard - Nice to have affected components, and reason for wanting change.
Jeff - Had a nice format for RFC's before. Short / Long versions. Might want to nail down.
Jeff - Propose we put "RFC" in PR title.
Jeff - should the body and format be in PR
Discussion about proposed work should be on devel email.
Discussion about already written code is on PR, and
Jeff proposes a wiki page describing this process.
Where - what does it affect.
When - when can we discuss? Give at least 1 week for others to reply.
What - summary
Why - Some justification, better than "I was board".
Down below deeper discussion.
o Supercomputing reports
+ OMPI BoF went well. Over 100 people in room. Slides on OMPI website, and on Jeff's Blog.
+ People appreciated the BoF format of "status, roadmap, what's going well, what needs more attention, etc"
+ PMI-x Bof Went well too. Scaling improvements went REALLY well.
+ PMI-x showed really good slope, they thought it was wire up times of daemons.
Mellanox needs to remove requirement to remove LID and GID, but still like a year
o Status Update: Mellanox, Sandia, Intel
+ Mellanox (via Ralph)
1. Artem will be working with Ralph et al. to finish off the OMPI side issues in PMIx integration.
2. Igor Ivanov will continue to fix memory corruption bugs uncovered in Valgrind.
3. Artem and Igor will start looking at making the necessary changes to UCX PML to use the direct modex.
4. Mellanox plans to submit UCX PML for inclusion in 1.10.3.
5. Mellanox plans to submit missing routines needed for OSHMEM 1.2 spec compliance for inclusion in 1.10.3. Igor Ivanov will be leading this.
+ Sandia (ryan Grant)
- Put Portals triggered Ops on master. Will run tests there for a while and then put PR for 2.0 branch.
+ Intel (Ralph)
- PMI-x Working on Pull Requests.
- HPC stuff occupying alot of his time. Announcing Open HPC to create a community distributions optimized for HPC.
- Building on top of OPAL.
o Howard has request for Slyvan / nVidia
+ Slyvan stopped Rolf's MTT yesterday, hoping to have it back by end of the week.
+ MLX5 HCAs - on master there are lots of errors, not sure because of software.
+ Nvidia cluster shows up bugs before other clusters.
+ Right now master under defaults running really clean. But turning on dynamic add-procs is showing lots of issue in Comm_dup, and other Comm creation code.
--- Status Update Rotation ----
LANL, Houston, HLRS, IBM
Cisco, ORNL, UTK, NVIDIA
Mellanox, Sandia, Intel