Re: [Dmtcp-forum] memory overhead at checkpoint

Vakho Tsulaia Thu, 07 Dec 2017 14:07:19 -0800

Hi Kapil,

Thanks for the quick response!

In order to reproduce the problem, I'm afraid you will have to run thefull-blown ATLAS Geant4simulation. We can discuss the details offline if you are interested todo that.

As for CoW semantics after restart, I would not call it a blocker for usat this point.Potential use-cases for dmtcp are currently being discussed in ATLAS, soI cannot saywhether or not we will be using and, if so, in what use-case scenarios.In any case itis good to know that this is not a fundamental problem and a solutioncan be implemented,

if necessary.

Thank you,
-- vakho

On 12/7/17 1:04 PM, Kapil Arya wrote:

Hi Vakho,

The 1.4GB to 1.8GB transition is probably due to DMTCP attempt to readall memory pages that could potentially result in data being read fromfiles, etc. To identify the culprit in your particular application, itwould be helpful if we can reproduce it locally.

In terms of CoW semantics after a restart, we had thought about usingMADV_MERGEABLE/MADV_SOFT_OFFLINEearlier, but didn't have a criticaldemand for it. If this is something that's a blocker for you, we wouldbe happy to find a solution.


Best,
Kapil

On Thu, Dec 7, 2017 at 3:39 PM, Vakho Tsulaia <vtsul...@lbl.gov<mailto:vtsul...@lbl.gov>> wrote:


    Hello,

    We are using DMTCP for checkpointing detector simulation program
    of the ATLAS experiment at CERN.
    Before checkpoint gets triggered the RSS of our application is
    ~1.4GB. When we trigger the checkpoint,
    it takes dmtcp few seconds to create checkpoint image on the disk.
    During this time the RSS goes up
    from 1.4GB to 1.8GB. When we restart the application, it continues
    with 1.8GB, which means the
    restarted application uses 400MB more than what it would have used
    without checkpoint-restart.

    Shortly after restart the application forks several sub-processes.
    It turns out that this extra 400MB
    does not get shared between sub-processes (which otherwise share
    memory pages thanks to Linux
    Copy-on-Write) and as a result we get NProcesses*400MB memory
    overhead for the entire system.

    Has anyone experienced similar problems? Is there anything we can
    do about it?

    I'm happy to provide more details about our application, if necessary.

    Thank you,
    -- vakho


    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, Slashdot.org! http://sdm.link/slashdot
    _______________________________________________
    Dmtcp-forum mailing list
    Dmtcp-forum@lists.sourceforge.net
    <mailto:Dmtcp-forum@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
    <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Re: [Dmtcp-forum] memory overhead at checkpoint

Reply via email to