Hi Kapil,
Thanks for the quick response!
In order to reproduce the problem, I'm afraid you will have to run the
full-blown ATLAS Geant4
simulation. We can discuss the details offline if you are interested to
do that.
As for CoW semantics after restart, I would not call it a blocker for us
at this point.
Potential use-cases for dmtcp are currently being discussed in ATLAS, so
I cannot say
whether or not we will be using and, if so, in what use-case scenarios.
In any case it
is good to know that this is not a fundamental problem and a solution
can be implemented,
if necessary.
Thank you,
-- vakho
On 12/7/17 1:04 PM, Kapil Arya wrote:
Hi Vakho,
The 1.4GB to 1.8GB transition is probably due to DMTCP attempt to read
all memory pages that could potentially result in data being read from
files, etc. To identify the culpritĀ in your particular application, it
would be helpful if we can reproduce it locally.
In terms of CoW semantics after a restart, we had thought about using
MADV_MERGEABLE/MADV_SOFT_OFFLINEearlier, but didn't have a critical
demand for it. If this is something that's a blocker for you, we would
be happy to find a solution.
Best,
Kapil
On Thu, Dec 7, 2017 at 3:39 PM, Vakho Tsulaia <vtsul...@lbl.gov
<mailto:vtsul...@lbl.gov>> wrote:
Hello,
We are using DMTCP for checkpointing detector simulation program
of the ATLAS experiment at CERN.
Before checkpoint gets triggered the RSS of our application is
~1.4GB. When we trigger the checkpoint,
it takes dmtcp few seconds to create checkpoint image on the disk.
During this time the RSS goes up
from 1.4GB to 1.8GB. When we restart the application, it continues
with 1.8GB, which means the
restarted application uses 400MB more than what it would have used
without checkpoint-restart.
Shortly after restart the application forks several sub-processes.
It turns out that this extra 400MB
does not get shared between sub-processes (which otherwise share
memory pages thanks to Linux
Copy-on-Write) and as a result we get NProcesses*400MB memory
overhead for the entire system.
Has anyone experienced similar problems? Is there anything we can
do about it?
I'm happy to provide more details about our application, if necessary.
Thank you,
-- vakho
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
<mailto:Dmtcp-forum@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum