FYI - for those not on the PMIx mailing list, we would welcome your input Ralph
> Begin forwarded message: > > From: Ralph Castain <r...@open-mpi.org> > Subject: PMIx 2.0 API thoughts > Date: August 22, 2015 at 8:45:50 AM PDT > To: pmix-de...@open-mpi.org > > Hi folks > > At the last PMIx telecon, people asked about the current status of suggested > PMIx 2.0 plans. I confess that we have been so absorbed by integrating PMIx > 1.x into OMPI/ORCM and SLURM that we haven’t developed these plans as much as > I had hoped. However, we have been collecting input from around the community > and have assembled some initial thoughts. > > The 2.0 effort can be broken into two general tracks: > > Performance Enhancements > > * Reduced memory footprint. We currently store a lot of information in every > process - in terms of per-proc footprint, it isn’t that huge, but when we get > to large numbers of procs/node there is a significant amount of duplicated > space. Courtesy of Elena (Mellanox), we already have a shared memory > implementation poised in the 2.0 branch that will allow the local PMIx server > to create one instance of the info, and then just point the local procs to > it. This will undoubtedly benefit from some refinement as we begin to fully > utilize it. > > * The data scoping feature included in the PMIx_Put function provides several > levels of locality. This helps reduce the amount of data being distributed > across nodes and being stored on each node. Current programming model > libraries (other than OMPI) don’t exploit this feature, and we need to > educate them on its use. > > * Distributed approach to database organization. If we consider the > publish/lookup data as well as the typical modex data, what we really have is > a traditional key-value datastore. Requirements for fault tolerance are > beginning to be reflected into this datastore, which raises questions > regarding where and how we store it. There has been a lot of research on the > best ways of storing and retrieving such data - the current approach isn’t > necessarily high on the list of “best” fault tolerant solutions. Therefore, > we might want to spend some time looking at distributed datastore methods > such as DHTs, included in PMIx as user-selectable options until we better > understand their impact on performance. This will mean adding an appropriate > infrastructure to PMIx to support such research. > > * Embedded collective algorithms for high-performance and scalable > barrier/allgather operations. We do not envision PMIx having inter-nodal > communication capabilities, but it is possible that we could provide APIs by > which we could direct scalable collectives across the PMIx servers. These > could range from traditional HPC algorithms to Adam Moody’s new “ring” model > - we should let the user direct the choice and provide a default automated > selection method. > > * Assuming we provide the above API, I guess that would open the question of > providing a complete standalone server that uses the provided communication > functions to execute inter-nodal operations. We would need to somehow wrap > the other server APIs so the host RM could do things like “register_nspace”, > but this might make adoption easier for those preferring something standalone. > > * Switching algorithms for full vs direct modex operations. PMIx provides the > ability to either fully exchange all data during the “modex” operation, or to > execute a zero-byte barrier (or even a “no-op”), with data retrieval done on > an “as-requested” basis (what we call “direct modex”). Deciding which of > these algorithms to use is currently left to the caller, which means it is > hardwired into the programming model library. However, the “optimal” decision > is probably a question of both scale and volume of data, which means that > some selection logic might be appropriate. We could either embed the decision > in PMIx, or at least provide an API to help users pick the best option based > on provided info. > > > Functional Extensions > These are focused on application-level APIs for interacting with the > RM/system mgmt, based on input from various users as well as the recent CORAL > RFP. Also, keep in mind that multi-tenant operations are likely to become > more common as we see continued increases in the number of cores on nodes > plus convergence of HPC with cloud environments. So the RM will likely become > involved in managing an expanding range of resources (see > http://www.open-mpi.org/papers/controls-2015/ > <http://www.open-mpi.org/papers/controls-2015/> for at least one vision of > how this might look), which means applications will need to interact with it > even more. > > * Application-directed workflow execution. I’ve been getting increased > requests for the ability to support emerging programming models where the > application “steers” execution (e.g., Hadoop, Spark, Legion, and Radical). In > these models, the application starts as a single process, and then computes > its resource needs (nodes, files, etc.), executes some operation that > involves spawning new processes, evaluates the results, and iterates. We > currently have the “spawn” API, but additional support will be required. I > suspect Gary’s SC15 BoF will help to identify those needs. > > * Request changes in power policy and settings. This is still an emerging > area of interest, and really only on the largest machines, but providing a > method for doing this would be helpful. > > * Direct positioning of files for use by the application or another job step > within the same allocated session. As on-node storage increases (e.g., with > NVRAM), opportunities exist for pre-positioning files and retaining files > and/or shared memory regions across job steps within the same allocated > session. This needs to be supported as part of a workflow script as well as > via a programmatic API (for dynamic directions by the application) - if we > provide a standard library for this purpose, then even the cmd line becomes > easy to support. This would also include specifying storage policies such as > hot/warm/cold locations, and requests for burst buffer management. > > * Request notification of events at the application and/or system level, > including warning of predicted failures for preemptive response, notification > of process failures, and other events of interest to the application. I’ve > received requests for a laundry list of events that various applications > would like to be notified about, so I think generalizing the existing > notification API makes the most sense. Currently, we only support > registration of a single notifier callback. This probably needs to be > extended to allow registration of an arbitrary number of callbacks, each for > a given event (or combination of events), with multiple callbacks for the > same event supported in some kind of defined progression (i.e., call this > function first, then call this one next - maybe with a “stop the callback > stack” return option). > > * Define notification response actions, including allocation of replacement > resources and launch of replacement processes. All of the specific operations > will probably be supported via other APIs, but we need a way to tell the host > RM that the application is indeed taking responsive action, so please don’t > do anything that would interfere with that action (e.g., terminate the job). > > * Request dynamic modification of allocations, including expansion and/or > partial release of the existing allocation, and new allocations for > subsequent spawn requests; > > * Request fabric QoS and security constraints, plus information on network > topology. > > > Obviously this is going to take some significant time to cover all these > areas, and I honestly don’t expect to see ALL of them in the 2.x series. This > is an open source community project, so the timing and ordering of the > features will depend on the interests of the developers, influenced as always > by feedback from users. > > So please feel free to “volunteer” to contribute! > Ralph >