On 5/12/2015 4:49 PM, Gene Cooperman wrote:

> 1) We still owe you on the restart issue of 2/26/15.  It's surprisingly 
> subtle.

Thanks for keeping at it.

> 2) "Am I right in guessing that a file on a remote drive is identified
>     in part by the uuid of the local mount point, and in part by information
>     about the remote drive?"
>
>     I believe that DMTCP knows nothing about local or remote drives or
>     mount points.  It should view files as part of a single unified 
> filesystem.

Ok, so here's a model of what might be happening.  DMTCP notes the _device_ a
file comes from.  When a system is rebuilt, the device looks different.  All we
need is one open file on that device to mess up the process.

>     Is there a simple way that we could locally test what you're seeing,
>     without having to crash a Grid Engine compute node :-).

Not sure.  I'm not 100% sure what is happening.  I've even seen where a recently
taken checkpoint fails (and all earlier ones do too).

> 3) "For java jar files, it appears that every checkpoint makes another copy of
>      an open jar file -- even when (as far as I know) such files are read 
> only."
>
>     The default policy of DMTCP should be that if the file is read-only,
>     and even if it is writeable but the offset is at the end of the file,
>     then DMTCP should _not_ be making a copy of the file.  The flag
>      --checkpoint-open-files for dmtcp_launch is intended to force DMTCP
>     to make copies of open files in order to overcome that default behavior.
>
>     If you're seeing something different, could you confirm that?  In that
>     case, I'll check again locally, to verify this bug.  Thanks.

It's not different, just suboptimal I think.  The files are almost certainly
opened only for reading (though I may be able to check that).  The position
in the files may skip around -- jar files are probed repeated to load different
class files and thus are accessed in more of a random access way than a strictly
sequential access way.  They may also be mapped into memory rather than accessed
by read/write calls.

Thanks for the other info ...

Regards -- Eliot

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to