Hi Eliot and Kapil,

Kapil,
    If you know the answer to this off the top of your head, could you
chime in?  It will save me searching through the DMTCP source code.

Eliot reports that the Java jar files are automatically being saved
as open files by DMTCP.  He has not used the --ckpt-open-files flag.

I had thought that we saved by default only files opened for writing for
which the file offset was in the middle of the file.  As Eliot pointed
out, the offset within a Java jar file is likely to be in the middle
of the file.  But Java jar files are opened read-only.
    Could it be that DMTCP is also saving any file whose offset is
in the middle of the file _even though it is opened read-only_?
    If so, we should re-think this issue, now that we are formally
documenting design issues under "issues" on github.

Eliot,
    As for the mysterious behavior under "Grid Engine", we're
interested.  We've never used Grid Engine ourselves.  Mostly,
we've worked with SLURM, Torque, ibrun; and LFS (not currently supported),
when it comes to resource managers and DMTCP.

Thanks,
- Gene

----- Original Message -----
From: Eliot Moss <m...@cs.umass.edu>
To: Gene Cooperman <g...@ccs.neu.edu>
Cc: dmtcp-forum@lists.sourceforge.net
Sent: Wed, 13 May 2015 00:01:03 -0400 (EDT)
Subject: Re: [Dmtcp-forum] Three things

On 5/12/2015 4:49 PM, Gene Cooperman wrote:

> 1) We still owe you on the restart issue of 2/26/15.  It's surprisingly 
> subtle.

Thanks for keeping at it.

> 2) "Am I right in guessing that a file on a remote drive is identified
>     in part by the uuid of the local mount point, and in part by information
>     about the remote drive?"
>
>     I believe that DMTCP knows nothing about local or remote drives or
>     mount points.  It should view files as part of a single unified 
> filesystem.

Ok, so here's a model of what might be happening.  DMTCP notes the _device_ a
file comes from.  When a system is rebuilt, the device looks different.  All we
need is one open file on that device to mess up the process.

>     Is there a simple way that we could locally test what you're seeing,
>     without having to crash a Grid Engine compute node :-).

Not sure.  I'm not 100% sure what is happening.  I've even seen where a recently
taken checkpoint fails (and all earlier ones do too).

> 3) "For java jar files, it appears that every checkpoint makes another copy of
>      an open jar file -- even when (as far as I know) such files are read 
> only."
>
>     The default policy of DMTCP should be that if the file is read-only,
>     and even if it is writeable but the offset is at the end of the file,
>     then DMTCP should _not_ be making a copy of the file.  The flag
>      --checkpoint-open-files for dmtcp_launch is intended to force DMTCP
>     to make copies of open files in order to overcome that default behavior.
>
>     If you're seeing something different, could you confirm that?  In that
>     case, I'll check again locally, to verify this bug.  Thanks.

It's not different, just suboptimal I think.  The files are almost certainly
opened only for reading (though I may be able to check that).  The position
in the files may skip around -- jar files are probed repeated to load different
class files and thus are accessed in more of a random access way than a strictly
sequential access way.  They may also be mapped into memory rather than accessed
by read/write calls.

Thanks for the other info ...

Regards -- Eliot


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to