Hello,

We are using DMTCP for checkpointing detector simulation program of the ATLAS experiment at CERN. Before checkpoint gets triggered the RSS of our application is ~1.4GB. When we trigger the checkpoint, it takes dmtcp few seconds to create checkpoint image on the disk. During this time the RSS goes up from 1.4GB to 1.8GB. When we restart the application, it continues with 1.8GB, which means the restarted application uses 400MB more than what it would have used without checkpoint-restart.

Shortly after restart the application forks several sub-processes. It turns out that this extra 400MB does not get shared between sub-processes (which otherwise share memory pages thanks to Linux Copy-on-Write) and as a result we get NProcesses*400MB memory overhead for the entire system.

Has anyone experienced similar problems? Is there anything we can do about it?

I'm happy to provide more details about our application, if necessary.

Thank you,
-- vakho


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to