Hi Kapil,

Thanks for the quick response!

In order to reproduce the problem, I'm afraid you will have to run the full-blown ATLAS Geant4 simulation. We can discuss the details offline if you are interested to do that.

As for CoW semantics after restart, I would not call it a blocker for us at this point. Potential use-cases for dmtcp are currently being discussed in ATLAS, so I cannot say whether or not we will be using and, if so, in what use-case scenarios. In any case it is good to know that this is not a fundamental problem and a solution can be implemented,
if necessary.

Thank you,
-- vakho

On 12/7/17 1:04 PM, Kapil Arya wrote:
Hi Vakho,

The 1.4GB to 1.8GB transition is probably due to DMTCP attempt to read all memory pages that could potentially result in data being read from files, etc. To identify the culpritĀ in your particular application, it would be helpful if we can reproduce it locally.

In terms of CoW semantics after a restart, we had thought about using MADV_MERGEABLE/MADV_SOFT_OFFLINEearlier, but didn't have a critical demand for it. If this is something that's a blocker for you, we would be happy to find a solution.

Best,
Kapil

On Thu, Dec 7, 2017 at 3:39 PM, Vakho Tsulaia <vtsul...@lbl.gov <mailto:vtsul...@lbl.gov>> wrote:

    Hello,

    We are using DMTCP for checkpointing detector simulation program
    of the ATLAS experiment at CERN.
    Before checkpoint gets triggered the RSS of our application is
    ~1.4GB. When we trigger the checkpoint,
    it takes dmtcp few seconds to create checkpoint image on the disk.
    During this time the RSS goes up
    from 1.4GB to 1.8GB. When we restart the application, it continues
    with 1.8GB, which means the
    restarted application uses 400MB more than what it would have used
    without checkpoint-restart.

    Shortly after restart the application forks several sub-processes.
    It turns out that this extra 400MB
    does not get shared between sub-processes (which otherwise share
    memory pages thanks to Linux
    Copy-on-Write) and as a result we get NProcesses*400MB memory
    overhead for the entire system.

    Has anyone experienced similar problems? Is there anything we can
    do about it?

    I'm happy to provide more details about our application, if necessary.

    Thank you,
    -- vakho


    
------------------------------------------------------------------------------
    Check out the vibrant tech community on one of the world's most
    engaging tech sites, Slashdot.org! http://sdm.link/slashdot
    _______________________________________________
    Dmtcp-forum mailing list
    Dmtcp-forum@lists.sourceforge.net
    <mailto:Dmtcp-forum@lists.sourceforge.net>
    https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
    <https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to