Hi Eliot,
    Good to hear from you again.  Sorry there was a delay before
we answered your bug report.

Hi Rohan and Jiajun,
    I see what the bug is.  Could one of you implement the bug fix
(see below)?

    I was able to reproduce the bug by checkpointing java1 from the
test suite on dekaksi:
  env CLASSPATH=./test ./bin/dmtcp_launch --checkpoint-open-files  -i7 java 
-Xmx5M java1

I then recursively copy ('scp -r') ckpt_java_* to CCIS Linux (since there
are some open files).

I then restart on CCIS Linux:
  bin/dmtcp_restart ckpt_java_1d4a852a5f139a6-40000-54f8acae.dmtcp
  [27628] mtcp_restart.c:1321 open_shared_file:
  unable to create file 
/usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar: 2
Segmentation fault (core dumped)

I then look at the checkpoint image:
  gzip -dc ckpt_java1*.dmtcp | util/readdmtcp.sh tmp.dmtcp 2>&1 | grep -- '-s'

Sure enough, Java is opening files at /usr/lib/jvm/... as shared files.
We try to restore it re-create the shared image in mtcp/mtcp_restart.c
with the _same_ underlying file.  But on the new host, the full pathname
of the underlying shared file has changed.

Presumably, Java creates the shared image so that the Java jvm can
share the memory mapped file among multiple running jvm's.

I assume that the solution is that if the underlying filename of a shared
memory image doesn't exist on the new target machine, then we should
simply open the file as shared, but with no underlying file,
using MAP_ANONYMOUS in mmap.

The necessary logic should be self-contained inside mtcp/mtcp_restart.c.

Jiajun or Rohan,
    Could one of you implement this fix (and also add this new issue
to github)?

Thanks,
- Gene


On Wed, Mar 04, 2015 at 03:53:30PM -0500, Kapil Arya wrote:
> Rohan,Jiajun,
> 
> Could one of you take a quick look at it?
> 
> Kapil
> 
> On Sat, Feb 28, 2015 at 12:04 PM, Eliot Moss <m...@cs.umass.edu> wrote:
> 
> > On 2/26/2015 7:19 PM, Eliot Moss wrote:
> >
> > > gunzip -c foo.gz | java blah blah 2> blah.err | gzip > bar.gz
> > >
> > > 1) Typically fails in restart if restarted on a host different from that
> > >      used for first part of the run.  The complaint is about Unix
> > shared-memory
> > >      stuff in the Java process.
> > >
> > >      Workaround: Restart only on the original host.
> >
> > Here's what happens when restarted on a different host:
> >
> > [42000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId
> > != -1) failed'
> >       (strerror((*__errno_location ()))) = No such file or directory
> > java (42000): Terminating...
> >
> > As for the other problems (relative versus absolute path for stderr of a
> > Java
> > process), either I had confounded it with the above or it does not happen
> > every
> > time, so I may have been wrong about it, and in any case do not currently
> > have
> > failure output for it.
> >
> > Regards -- EM
> >
> >
> > ------------------------------------------------------------------------------
> > Dive into the World of Parallel Programming The Go Parallel Website,
> > sponsored
> > by Intel and developed in partnership with Slashdot Media, is your hub for
> > all
> > things parallel software development, from weekly thought leadership blogs
> > to
> > news, videos, case studies, tutorials and more. Take a look and join the
> > conversation now. http://goparallel.sourceforge.net/
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >

> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. http://goparallel.sourceforge.net/

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to