I’ve spent the past 2 weeks trying to compile DMTCP for a basic Linux cluster made of Skylake nodes and Infiniband EDR interconnect. The OS is CentOS Linux release 7.8.2003 and it uses SLURM as the resource manager.
I have compiled with gcc versions 7 thru 9, against MPICH2, MVAPICH2, and Open-MPI, multiple versions. As far as DMTCP versions go, I have tried 2.6.2r1 and the main Git repo. I also tried the MANA version but the only branch that will compile at all is the “refactor” branch but the built dmtcp_coordinator executable core dumps immediately. For the other DMTCP source code versions, a basic “configure; make; make install” will work but the built DMTCP fails to function properly. The two main issues are 1) killing the running tasks after checkpointing leaves stray processes running after killing ’srun’, and 2) all attempts to restart a checkpointed job result in hanging tasks and no restart. I tried to use the ‘—enable-infiniband-support’ flag to configure but then the infinibandwrapper code fails to build (reported as far back as 2018). The bugs we are encountering have all been reported, in this forum as well as in the ‘issues’ section of the git repo. That makes me wonder, is this project dead as far as MPI tasks go? Is there a page out there with canonical instructions for how to make it work on a Linux cluster? Thanks. -- David Gunter CCS-7 Los Alamos National Laboratory
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum