I’ve spent the past 2 weeks trying to compile DMTCP for a basic Linux cluster 
made of Skylake nodes and Infiniband EDR interconnect. The OS is CentOS Linux 
release 7.8.2003 and it uses SLURM as the resource manager.

I have compiled with gcc versions 7 thru 9, against MPICH2, MVAPICH2, and 
Open-MPI, multiple versions.

As far as DMTCP versions go, I have tried 2.6.2r1 and the main Git repo. I also 
tried the MANA version but the only branch that will compile at all is the 
“refactor” branch but the built dmtcp_coordinator executable core dumps 
immediately.

For the other DMTCP source code versions, a basic “configure; make; make 
install” will work but the built DMTCP fails to function properly. The two main 
issues are 1) killing the running tasks after checkpointing leaves stray 
processes running after killing ’srun’, and 2) all attempts to restart a 
checkpointed job result in hanging tasks and no restart.

I tried to use the ‘—enable-infiniband-support’ flag to configure but then the 
infinibandwrapper code fails to build (reported as far back as 2018).

The bugs we are encountering have all been reported, in this forum as well as 
in the ‘issues’ section of the git repo. That makes me wonder, is this project 
dead as far as MPI tasks go? Is there a page out there with canonical 
instructions for how to make it work on a Linux cluster?

Thanks.
--
David Gunter
CCS-7
Los Alamos National Laboratory




Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to