On 3/5/2015 2:45 PM, Gene Cooperman wrote: > Hi Eliot, > Good to hear from you again. Sorry there was a delay before > we answered your bug report. > > Hi Rohan and Jiajun, > I see what the bug is. Could one of you implement the bug fix > (see below)?
So here's a wondering. I am not sure the file will have a different *name* on different hosts. The naming scheme through the file system should be the same. However, on different hosts the file might be mapped to different locations when linked, and that could be problematic, no? I am not even sure how Java could be made to adjust to that. I think you'd have to request mapping to the same address. The files that I think are in question are on NFS mounts, and the mount information indicated the remote system ip address AND the local client ip address. Maybe that somehow is viewed as part of the name of the files? Thanks very much for investigating! When you have a fix I think I can probably test it fairly easily. Regards -- Eliot Moss > I was able to reproduce the bug by checkpointing java1 from the > test suite on dekaksi: > env CLASSPATH=./test ./bin/dmtcp_launch --checkpoint-open-files -i7 java > -Xmx5M java1 > > I then recursively copy ('scp -r') ckpt_java_* to CCIS Linux (since there > are some open files). > > I then restart on CCIS Linux: > bin/dmtcp_restart ckpt_java_1d4a852a5f139a6-40000-54f8acae.dmtcp > [27628] mtcp_restart.c:1321 open_shared_file: > unable to create file > /usr/lib/jvm/java-7-openjdk-common/jre/lib/ext/pulse-java.jar: 2 > Segmentation fault (core dumped) > > I then look at the checkpoint image: > gzip -dc ckpt_java1*.dmtcp | util/readdmtcp.sh tmp.dmtcp 2>&1 | grep -- > '-s' > > Sure enough, Java is opening files at /usr/lib/jvm/... as shared files. > We try to restore it re-create the shared image in mtcp/mtcp_restart.c > with the _same_ underlying file. But on the new host, the full pathname > of the underlying shared file has changed. > > Presumably, Java creates the shared image so that the Java jvm can > share the memory mapped file among multiple running jvm's. > > I assume that the solution is that if the underlying filename of a shared > memory image doesn't exist on the new target machine, then we should > simply open the file as shared, but with no underlying file, > using MAP_ANONYMOUS in mmap. > > The necessary logic should be self-contained inside mtcp/mtcp_restart.c. > > Jiajun or Rohan, > Could one of you implement this fix (and also add this new issue > to github)? > > Thanks, > - Gene > > > On Wed, Mar 04, 2015 at 03:53:30PM -0500, Kapil Arya wrote: >> Rohan,Jiajun, >> >> Could one of you take a quick look at it? >> >> Kapil >> >> On Sat, Feb 28, 2015 at 12:04 PM, Eliot Moss <m...@cs.umass.edu> wrote: >> >>> On 2/26/2015 7:19 PM, Eliot Moss wrote: >>> >>>> gunzip -c foo.gz | java blah blah 2> blah.err | gzip > bar.gz >>>> >>>> 1) Typically fails in restart if restarted on a host different from that >>>> used for first part of the run. The complaint is about Unix >>> shared-memory >>>> stuff in the Java process. >>>> >>>> Workaround: Restart only on the original host. >>> >>> Here's what happens when restarted on a different host: >>> >>> [42000] ERROR at sysvipc.cpp:775 in postRestart; REASON='JASSERT(_realId >>> != -1) failed' >>> (strerror((*__errno_location ()))) = No such file or directory >>> java (42000): Terminating... >>> >>> As for the other problems (relative versus absolute path for stderr of a >>> Java >>> process), either I had confounded it with the above or it does not happen >>> every >>> time, so I may have been wrong about it, and in any case do not currently >>> have >>> failure output for it. >>> >>> Regards -- EM >>> >>> >>> ------------------------------------------------------------------------------ >>> Dive into the World of Parallel Programming The Go Parallel Website, >>> sponsored >>> by Intel and developed in partnership with Slashdot Media, is your hub for >>> all >>> things parallel software development, from weekly thought leadership blogs >>> to >>> news, videos, case studies, tutorials and more. Take a look and join the >>> conversation now. http://goparallel.sourceforge.net/ >>> _______________________________________________ >>> Dmtcp-forum mailing list >>> Dmtcp-forum@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>> > >> ------------------------------------------------------------------------------ >> Dive into the World of Parallel Programming The Go Parallel Website, >> sponsored >> by Intel and developed in partnership with Slashdot Media, is your hub for >> all >> things parallel software development, from weekly thought leadership blogs to >> news, videos, case studies, tutorials and more. Take a look and join the >> conversation now. http://goparallel.sourceforge.net/ > >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmtcp-forum@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum