Hi!

We are using DMTCP (version 2.4.4) to checkpoint an MPI application (Open MPI 1.7.5), and we are finding some trouble. Linux version is Debian 6.0, the cluster has Intel Xeon e5405 processors and a Gigabit Ethernet connection.

Primarily, we are trying to trigger the checkpoint from inside the application (include dmtcp.h), and calling the function dmtcp_checkpoint(). Both checkpoint and restart work correctly when all MPI processes run in intra-node cores, but we cannot make it work when two nodes are involved.

We have tried to run the same test, but checkpointing manually from outside (c command in coordinator) and again checkpoint and restart work correctly, so the problem is checkpointing from inside.

We appreciate any help to solve this issue. Thank you very much in advance.

Diego





_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to