Hello interested parties, As part of the work for the accelerator framework, the non standard behavior of the existing cuda code in Open MPI is being reworked. One of the proposed changes involves a change to the behavior of linking/compiling cuda components.
Currently, cuda functions are loaded dynamically using dlopen and stored in a function pointer table, with some code to search through typical paths to locate libcuda. This means that we can compile Open MPI –with-cuda=/path/to/cuda and the resulting build should work on both cuda and non cuda environments. The change we are making involves removing the function pointer table and instead, having relevant components have a direct dependency on libcuda. This is in line with the rest of Open MPI’s MCA system where you can build components as dsos. The difference here are: Open MPI will call libcuda functions directly and components that have a cuda dependency will be built as dso’s (ie. –with-cuda=/path/to/cuda/ –enable-mca-dso=accelerator-cuda). During linking, these dso’s may fail to load, such as on a non cuda environment, but this won’t prevent Open MPI from functioning. A related work - https://github.com/open-mpi/ompi/pull/10763 - to have an option to silence warnings that occur in this expected behavior path is also being worked on. From a user behavior, nothing changes. From compilation, dependent components will need to be built as dso’s. From code, we can remove dlopen dependency for cuda builds, standardize the cuda code with the rest of Open MPI, and remove code involved with storing function pointers and detecting libcuda location. Please provide feedback if you have any suggestions or are against these changes. Thanks, William Zhang