Hi Wijnand, I have been trying to reproduce this issue on my setup. I have tried this with DMTCP-2.3.1 and the latest DMTCP from the Github trunk (github.com/dmtcp/dmtcp). Most of the `make check` tests pass in both the cases, but some tests are failing. As you said, it could be a kernel issue; I have started looking into this. Here are the details:
$ lsb_release -a LSB Version: n/a Distributor ID: openSUSE project Description: openSUSE 13.2 (Harlequin) (x86_64) Release: 13.2 Codename: Harlequin $ uname -a Linux linux.site 3.16.7-7-desktop #1 SMP PREEMPT Wed Dec 17 18:00:44 UTC 2014 (762f27a) x86_64 x86_64 x86_64 GNU/Linux In the meantime, could you please try the DMTCP trunk from Github and check if the issue persists? Also, try running the tests using `DMTCP_GZIP=0 make check`. I’ve noticed some kernel lockups when using gzip. Thanks, Rohan > On Feb 23, 2015, at 5:47 AM, Wijnand SUIJLEN <wijnand.suij...@huawei.com> > wrote: > > Hi Rohan, > > Thanks for looking into this. I did what you wrote: Downloaded the source > tarball and did a "./configure && make && make check-dmtcp1". Indeed, the > test fails. For completeness sake, I attached to this e-mail (build.log) a > log of a complete "./configure && make && make check". All tests are failing. > > The other attachment (testdir-contents) is a directory listing of the > location where the test scripts copies the checkpoints to. To prove that the > checkpoints files are not all-zeros, below are the first 10 lines of a > hexdump of the dmtcp1 checkpoint > > wijnand@linux-6ea7:~/bla> hexdump -C > /tmp/dmtcp-wijn...@linux-6ea7.site/dmtcp-autotest-845315162/ckpt_dmtcp1_1b03894b1c5e0f7-40000-54e6847e.dmtcp > | head -n 10 > 00000000 1f 8b 08 00 7e 84 e6 54 04 03 ec dd 7d 6c 5d 67 |....~..T....}l]g| > 00000010 7d 07 f0 63 a7 49 4c db 39 e6 65 9d 29 2f 75 a1 |}..c.IL.9.e.)/u.| > 00000020 65 06 c9 76 ec bc 60 82 18 76 9b b4 37 5b 80 ac |e..v..`..v..7[..| > 00000030 24 23 ac 4b 6d 27 76 70 da c4 31 f5 0d 35 94 d1 |$#.Km'vp..1..5..| > 00000040 8c b4 2c 16 2b 64 13 db bc a9 62 a1 2a 34 9b d0 |..,.+d....b.*4..| > 00000050 94 49 fb 23 e2 0f 9a ae 94 04 6d 62 65 05 2d 7b |.I.#......mbe.-{| > 00000060 11 ca 50 db 39 65 48 45 bc 79 08 c8 ce 73 ce f3 |..P.9eHE.y...s..| > 00000070 a4 f6 95 9d bb 21 9a 32 fa 39 d5 bd df f3 7b 9e |.....!.2.9....{.| > 00000080 e7 bc 7d ee 71 ee cb b9 76 d7 bf 6d cb f5 9b 07 |..}.q...v..m....| > 00000090 ae af 6c b8 fe 37 36 bf 63 e3 db b7 0c 6c 7c 5b |..l..76.c....l|[| > > > I can imagine that the error occurs because of a recent change in the Linux > kernel. Therefore, here is the output of uname on my system: > wijnand@linux-6ea7:~/bla> uname -a > Linux linux-6ea7.site 3.16.7-7-desktop #1 SMP PREEMPT Wed Dec 17 18:00:44 UTC > 2014 (762f27a) x86_64 x86_64 x86_64 GNU/Linux > > Kind regards, > Wijnand Suijlen > > > -----Original Message----- > From: Rohan Garg [mailto:rohg...@ccs.neu.edu] > Sent: Friday, February 20, 2015 5:57 PM > To: Wijnand SUIJLEN > Cc: dmtcp-forum > Subject: Re: [Dmtcp-forum] dmtcp_restart (dmtcp 2.3.1) won't restart from > checkpoint on OpenSUSE 13.2 > > Hi Wijnand, > > Could you please try the following steps? > > 1) Download the source tarball from: > http://sourceforge.net/projects/dmtcp/files/dmtcp-2.x/2.3.1/ > 2) ./configure && make && make check-dmtcp1 > 3) Verify that the test passes > > Also, could you please verify that the checkpoint image is of non-zero size? > It could be that dmtcp_launch is failing to create a valid checkpoint image. > > Thanks, > Rohan > > ----- Original Message ----- > From: "Kapil Arya" <kapil.arya...@gmail.com> > To: "Wijnand SUIJLEN" <wijnand.suij...@huawei.com> > Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net> > Sent: Friday, February 20, 2015 11:04:04 AM > Subject: Re: [Dmtcp-forum] dmtcp_restart (dmtcp 2.3.1) won't restart from > checkpoint on OpenSUSE 13.2 > > Rohan/Jiajun, > > Can you take a look at it? > > Best, > Kapil > > On Thu, Feb 19, 2015 at 11:40 AM, Wijnand SUIJLEN < > wijnand.suij...@huawei.com > wrote: > > > Hi, > > I am running OpenSUSE 13.2 (64-bit) inside VirtualBox and I am trying to get > DMTCP 2.3.1 to work on the simple example 'dmtcp1.c' as supplied in the > source tar.gz distribution of the package (dmtcp-2.3.1/test/dmtcp1.c). There > are no complaints from DMTCP when writing the checkpoint. However, the > dmtcp_restart fails to restart the program with the error message "only read > 0 bytes instead of 4096 from checkpoint file". > > What am I doing wrong and what should I do to make it work? > > Some details: > I have tried it with an installation as built directly from the source > distribution and I tried it with the binary distribution as supplied by > OpenSUSE 13.2: It doesn't make any difference. > http://download.opensuse.org/distribution/13.2/repo/oss/suse/x86_64/dmtcp-2.3.1-2.2.2.x86_64.rpm > http://download.opensuse.org/distribution/13.2/repo/oss/suse/x86_64/dmtcp-devel-2.3.1-2.2.2.x86_64.rpm > > > --- start console log --- > wijnand@linux-6ea7:~/bla/dmtcp-2.3.1/test> gcc -fPIC -o dmtcp1 dmtcp1.c > wijnand@linux-6ea7:~/bla/dmtcp-2.3.1/test> dmtcp_checkpoint ./dmtcp1 > 1 2 3 4 5 6 7 8 9 ^C > wijnand@linux-6ea7:~/bla/dmtcp-2.3.1/test> dmtcp_restart > ckpt_dmtcp1_1b03894b1c5e0f7-40000-54e602b1.dmtcp > [27865] mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:235 mtcp_readfile: > only read 0 bytes instead of 4096 from checkpoint file [27865] > mtcp_util.ic:237 mtcp_readfile: > failed to read after 10 tries in a row. > Segmentation fault > wijnand@linux-6ea7:~/bla/dmtcp-2.3.1/test> > --- end console log --- > > --- start dmtcp_coordinator log --- > wijnand@linux-6ea7:~/bla/dmtcp-2.3.1/test> dmtcp_coordinator > dmtcp_coordinator (DMTCP) 2.3.1 License LGPLv3+: GNU LGPL version 3 or later > < http://gnu.org/licenses/lgpl.html >. > This program comes with ABSOLUTELY NO WARRANTY. > This is free software, and you are welcome to redistribute it under certain > conditions; see COPYING file for details. > (Use flag "-q" to hide this message.) > > dmtcp_coordinator starting... > Host: linux-6ea7.site (0.0.0.0) > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) Exit on last > client: 0 Type '?' for help. > > [27848] NOTE at dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker > connected' > hello_remote.from = 1b03894b1c5e0f7-27851-54e602b1 [27848] NOTE at > dmtcp_coordinator.cpp:825 in onData; REASON='Updating process Information > after exec()' > progname = dmtcp1 > msg.from = 1b03894b1c5e0f7-40000-54e602b1 > client->identity() = 1b03894b1c5e0f7-27851-54e602b1 > c > [27848] NOTE at dmtcp_coordinator.cpp:1271 in startCheckpoint; > REASON='starting checkpoint, suspending all nodes' > s.numPeers = 1 > [27848] NOTE at dmtcp_coordinator.cpp:1273 in startCheckpoint; > REASON='Incremented Generation' > compId.generation() = 1 > [27848] NOTE at dmtcp_coordinator.cpp:615 in updateMinimumState; > REASON='locking all nodes' > [27848] NOTE at dmtcp_coordinator.cpp:621 in updateMinimumState; > REASON='draining all nodes' > [27848] NOTE at dmtcp_coordinator.cpp:627 in updateMinimumState; > REASON='checkpointing all nodes' > [27848] NOTE at dmtcp_coordinator.cpp:641 in updateMinimumState; > REASON='building name service database' > [27848] NOTE at dmtcp_coordinator.cpp:657 in updateMinimumState; > REASON='entertaining queries now' > [27848] NOTE at dmtcp_coordinator.cpp:662 in updateMinimumState; > REASON='refilling all nodes' > [27848] NOTE at dmtcp_coordinator.cpp:693 in updateMinimumState; > REASON='restarting all nodes' > [27848] NOTE at dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client > disconnected' > client->identity() = 1b03894b1c5e0f7-40000-54e602b1 > [27848] NOTE at dmtcp_coordinator.cpp:1096 in > validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart connection. Set > numPeers. Generate timestamp' > numPeers = 1 > curTimeStamp = 22789762212 > compId = 1b03894b1c5e0f7-40000-54e602b1 [27848] NOTE at > dmtcp_coordinator.cpp:1040 in onConnect; REASON='worker connected' > hello_remote.from = 1b03894b1c5e0f7-40000-54e602b1 [27848] NOTE at > dmtcp_coordinator.cpp:875 in onDisconnect; REASON='client disconnected' > client->identity() = 1b03894b1c5e0f7-40000-54e602b1 > --- end dmtcp_coordinator log --- > > > Kind regards, > Wijnand Suijlen > > ------------------------------------------------------------------------------ > > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > ------------------------------------------------------------------------------ > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > <build.log><testdir-contents> ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum