Hi!
We are using DMTCP (version 2.4.4) to checkpoint an MPI application
(Open MPI 1.7.5), and we are finding some trouble. Linux version is
Debian 6.0, the cluster has Intel Xeon e5405 processors and a Gigabit
Ethernet connection.
Primarily, we are trying to trigger the checkpoint from inside the
application (include dmtcp.h), and calling the function
dmtcp_checkpoint(). Both checkpoint and restart work correctly when
all MPI processes run in intra-node cores, but we cannot make it work
when two nodes are involved.
We have tried to run the same test, but checkpointing manually from
outside (c command in coordinator) and again checkpoint and restart
work correctly, so the problem is checkpointing from inside.
We appreciate any help to solve this issue. Thank you very much in advance.
Diego
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum