Hi Harry,

It looks like DMTCP has gone in some weird checkpointing loop. Can you try
unsetting DMTCP_CHECKPOINT_INTERVAL and try checkpointing manually from the
dmtcp_coordinator window? That might helps us shed some light on what went
wrong.

Also, would it be possible for you to share the application binary/code
with us to help us reproduce this issue?

Best,
Kapil

On Thu, May 17, 2018 at 6:36 PM harry mangalam <harry.manga...@uci.edu>
wrote:

> Hi All,
>
>
>
> We're currently using BLCR to c/r jobs on our academic cluster and after
> some work, it works great for the things that it can do, but recently the
> BL lost funding to support it and it will no longer run with new kernels.
>
>
>
> So we're looking for a new c/r infrastructure and a user from NYU rec'ed
> DMTCP. I used the version in the Ubuntu repository, but it would checkpoint
> OK, but would refuse to restart.
>
>
>
> I then downloaded the latest stable code, which compiled perfectly, as far
> as I could see, and installed it. It passed all the tests but one:
>
> ================================
>
> == Tests ==
> ..
>
>
>
> epoll1         FAILED
>               root-pids: [5791] msg: user program startup error, 2
> expected, 1 found, running=1
> ..
>
>
>
> == Summary ==
> stunted: 58 of 59 tests passed
> ================================
>
>
>
>
>
> The new one had a similar problem in that it would checkpoint, but not
> restart.
>
>
>
> However, this was tested on my laptop and when I moved to my office from
> home (causing a network reset), and tried to restart the job one last time,
> the job DID restart, but lost the redirection to STDOUT, and dumped a lot
> of output and finally segfaulted (which the naked job does not do.) Tried
> it 2x with the same result.
>
>
>
> With the current (latest stable, varsion 2.5.2) version, still on my
> laptop
>
> $ uname -a
>
>
> Linux stunted 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016
> x86_64 x86_64 x86_64 GNU/Linux
>
> This is the sequence of events:
>
>
>
> In term1:
> $ ps aux | grep -i dctc[p]
>
>
> (nothing)
> $ export DMTCP_CHECKPOINT_INTERVAL=30
> $ export DMTCP_PORT=8889
> $ export DMTCP_GZIP=0
> $ export DMTCP_CHECKPOINT_DIR=/tmp/dmtcp
> $ dmtcp_coordinator
> dmtcp_coordinator starting...
>    Host: stunted.nac.uci.edu (0.0.0.0)
>    Port: 8889
>    Checkpoint Interval: 30
>    Exit on last client: 0
> Type '?' for help.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> In term2:
>
>
> $ cd
>                                                                               
>                         $ export
> DMTCP_CHECKPOINT_INTERVAL=30
>
> $ export DMTCP_PORT=8889
> $ export DMTCP_GZIP=0
> $ export DMTCP_CHECKPOINT_DIR=/tmp/dmtcp
> $ ps aux | grep dmtcp
>                                                                               
>        hjm
>        547  0.0  0.0  26376  3932 pts/1    S+   15:07   0:00
> dmtcp_coordinator
> $ sleep 1
>                                                                               
>                    $
> rm -f /tmp/dmtcp/*
>                                                                               
>         $
> dmtcp_launch tacg -n6 -slLc -S -F2 < chr1.fa > junk/jj &
>
> [2] 651
> $ tacgid=$!
>                                                                               
>                  $
> echo "captured PID = $tacgid"
> captured PID = 651
> $ sleep 5
>
>
>
> (stderr starts...)
>
>
> [40000] NOTE at socketconnlist.cpp:177 in scanForPreExisting;
> REASON='found pre-existing socket... will not be
> restored'
>     fd = 20
>     device = pipe:[31581]
> [40000] WARNING at socketconnection.cpp:224 in TcpConnection;
> REASON='JWARNING((domain == AF_INET || domain ==
> AF_UNIX || domain == AF_INET6) && (type & 077) == SOCK_STREAM) failed'
>     domain = 0
>     type = 0
>     protocol = 0
> [40000] NOTE at socketconnlist.cpp:177 in scanForPreExisting;
> REASON='found pre-existing socket... will not be
> restored'
>     fd = 22
>     device = pipe:[31581]
> [40000] WARNING at socketconnection.cpp:224 in TcpConnection;
> REASON='JWARNING((domain == AF_INET || domain ==
> AF_UNIX || domain == AF_INET6) && (type & 077) == SOCK_STREAM) failed'
>     domain = 0
>     type = 0
>     protocol = 0
> [40000] NOTE at socketconnlist.cpp:177 in scanForPreExisting;
> REASON='found pre-existing socket... will not be
> restored'
>     fd = 23
>     device = pipe:[31582]
> [40000] WARNING at socketconnection.cpp:224 in TcpConnection;
> REASON='JWARNING((domain == AF_INET || domain ==
> AF_UNIX || domain == AF_INET6) && (type & 077) == SOCK_STREAM) failed'
>     domain = 0
>     type = 0
>     protocol = 0
> [40000] NOTE at socketconnlist.cpp:177 in scanForPreExisting;
> REASON='found pre-existing socket... will not be
> restored'
>     fd = 24
>     device = pipe:[31582]
> [40000] WARNING at socketconnection.cpp:224 in TcpConnection;
> REASON='JWARNING((domain == AF_INET || domain ==
> AF_UNIX || domain == AF_INET6) && (type & 077) == SOCK_STREAM) failed'
>     domain = 0
>     type = 0
>     protocol = 0
>
> $ dmtcp_command -c
>                                                                               
>           $
> sleep 5
>                                                                               
>                    $
> ls -l /tmp/dmtcp
>
>
> total 258316
> -rw------- 1 hjm hjm 264499200 May 17 15:09
> ckpt_tacg_21af429c2e5d9c5-40000-653327200ae8.dmtcp
> lrwxrwxrwx 1 hjm hjm        58 May 17 15:09 dmtcp_restart_script.sh ->
> dmtcp_restart_script_21af429c2e5d9c5-400
> 00-65332637bcc1.sh*
>                                                                               
>            -rwxrw-r--
> 1 hjm hjm     12416 May 17 15:09
> dmtcp_restart_script_21af429c2e5d9c5-40000-65332637bcc1.sh*
> $ sleep 2
> $ ls -l /tmp/dmtcp
> total 258316
> -rw------- 1 hjm hjm 264499200 May 17 15:09
> ckpt_tacg_21af429c2e5d9c5-40000-653327200ae8.dmtcp
> lrwxrwxrwx 1 hjm hjm        58 May 17 15:09 dmtcp_restart_script.sh ->
> dmtcp_restart_script_21af429c2e5d9c5-400
> 00-65332637bcc1.sh*
>                                                                               
>              -rwxrw-r--
> 1 hjm hjm     12416 May 17 15:09
> dmtcp_restart_script_21af429c2e5d9c5-40000-65332637bcc1.sh*
>
> $ kill -9 $tacgid
>                                                                               
>            $
> sleep 1
> [2]+  Killed                  dmtcp_launch tacg -n6 -slLc -S -F2 <
> /home/hjm/tacg/hg19/chr1.fa > junk/jj
> $ ps aux | grep tac[g]
> hjm        741  0.0  0.0  11284   928 pts/2    S+   15:09   0:00 grep tacg
> $ cd /tmp/dmtcp
>                                                                               
>              $
> ls -l
>
>
> total 258316
> -rw------- 1 hjm hjm 264499200 May 17 15:09
> ckpt_tacg_21af429c2e5d9c5-40000-653327200ae8.dmtcp
> lrwxrwxrwx 1 hjm hjm        58 May 17 15:09 dmtcp_restart_script.sh ->
> dmtcp_restart_script_21af429c2e5d9c5-400
> 00-65332637bcc1.sh*
>                                                                               
>              -rwxrw-r--
> 1 hjm hjm     12416 May 17 15:09
> dmtcp_restart_script_21af429c2e5d9c5-40000-65332637bcc1.sh*
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> In term1, the following appeared after the start of the tacg app:
>
>
>
> [547] NOTE at dmtcp_coordinator.cpp:1368 in updateCheckpointInterval;
> REASON='CheckpointInterval updated (for t
> his computation only)'
>     oldInterval = 30
>     theCheckpointInterval = 30
> [547] NOTE at dmtcp_coordinator.cpp:917 in onConnect; REASON='worker
> connected'
>     hello_remote.from = 21af429c2e5d9c5-651-65332637bcc1
> [547] NOTE at dmtcp_coordinator.cpp:667 in onData; REASON='Updating
> process Information after exec()'
>     progname = tacg
>     msg.from = 21af429c2e5d9c5-40000-653327200ae8
>     client->identity() = 21af429c2e5d9c5-651-65332637bcc1
> [547] NOTE at dmtcp_coordinator.cpp:1145 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>     s.numPeers = 1
> [547] NOTE at dmtcp_coordinator.cpp:1147 in startCheckpoint;
> REASON='Incremented computationGeneration'
>     compId.computationGeneration() = 1
> [547] NOTE at dmtcp_coordinator.cpp:437 in updateMinimumState;
> REASON='locking all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:443 in updateMinimumState;
> REASON='draining all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:449 in updateMinimumState;
> REASON='checkpointing all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:473 in updateMinimumState;
> REASON='building name service database'
> [547] NOTE at dmtcp_coordinator.cpp:489 in updateMinimumState;
> REASON='entertaining queries now'
> [547] NOTE at dmtcp_coordinator.cpp:494 in updateMinimumState;
> REASON='refilling all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:534 in updateMinimumState;
> REASON='restarting all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:717 in onDisconnect; REASON='client
> disconnected'
>     client->identity() = 21af429c2e5d9c5-40000-653327200ae8
>     client->progname() = tacg
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> In term2, I try to restart the checkpointed app:
>
>
>
> $ ./dmtcp_restart_script.sh
>
>
> (and now it just sits there)
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> while in term1: output starts
>
>
>
> [547] NOTE at dmtcp_coordinator.cpp:992 in
> validateRestartingWorkerProcess; REASON='FIRST dmtcp_restart connec
> tion.  Set numPeers. Generate timestamp'
>     numPeers = 1
>     curTimeStamp = 111740086278100
>     compId = 21af429c2e5d9c5-40000-65332637bcc1
> [547] NOTE at dmtcp_coordinator.cpp:917 in onConnect; REASON='worker
> connected'
>     hello_remote.from = 21af429c2e5d9c5-40000-653327200ae8
> [547] NOTE at dmtcp_coordinator.cpp:484 in updateMinimumState;
> REASON='building name service database (after re
> start)'
> [547] NOTE at dmtcp_coordinator.cpp:489 in updateMinimumState;
> REASON='entertaining queries now'
> [547] NOTE at dmtcp_coordinator.cpp:494 in updateMinimumState;
> REASON='refilling all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:534 in updateMinimumState;
> REASON='restarting all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:1145 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>     s.numPeers = 1
> [547] NOTE at dmtcp_coordinator.cpp:1147 in startCheckpoint;
> REASON='Incremented computationGeneration'
>     compId.computationGeneration() = 2
> [547] NOTE at dmtcp_coordinator.cpp:437 in updateMinimumState;
> REASON='locking all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:443 in updateMinimumState;
> REASON='draining all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:449 in updateMinimumState;
> REASON='checkpointing all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:473 in updateMinimumState;
> REASON='building name service database'
> [547] NOTE at dmtcp_coordinator.cpp:489 in updateMinimumState;
> REASON='entertaining queries now'
> [547] NOTE at dmtcp_coordinator.cpp:494 in updateMinimumState;
> REASON='refilling all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:534 in updateMinimumState;
> REASON='restarting all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:1145 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>     s.numPeers = 1
> [547] NOTE at dmtcp_coordinator.cpp:1147 in startCheckpoint;
> REASON='Incremented computationGeneration'
>     compId.computationGeneration() = 3
> [547] NOTE at dmtcp_coordinator.cpp:437 in updateMinimumState;
> REASON='locking all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:443 in updateMinimumState;
> REASON='draining all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:449 in updateMinimumState;
> REASON='checkpointing all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:473 in updateMinimumState;
> REASON='building name service database'
> [547] NOTE at dmtcp_coordinator.cpp:489 in updateMinimumState;
> REASON='entertaining queries now'
> [547] NOTE at dmtcp_coordinator.cpp:494 in updateMinimumState;
> REASON='refilling all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:534 in updateMinimumState;
> REASON='restarting all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:1145 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>     s.numPeers = 1
> [547] NOTE at dmtcp_coordinator.cpp:1147 in startCheckpoint;
> REASON='Incremented computationGeneration'
>     compId.computationGeneration() = 4
> [547] NOTE at dmtcp_coordinator.cpp:437 in updateMinimumState;
> REASON='locking all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:443 in updateMinimumState;
> REASON='draining all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:449 in updateMinimumState;
> REASON='checkpointing all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:473 in updateMinimumState;
> REASON='building name service database'
> [547] NOTE at dmtcp_coordinator.cpp:489 in updateMinimumState;
> REASON='entertaining queries now'
> [547] NOTE at dmtcp_coordinator.cpp:494 in updateMinimumState;
> REASON='refilling all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:534 in updateMinimumState;
> REASON='restarting all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:1145 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>     s.numPeers = 1
>
> and every 30 seconds (the checkpoint period), I see a new stanza of
>
>
> [547] NOTE at dmtcp_coordinator.cpp:1147 in startCheckpoint;
> REASON='Incremented computationGeneration'
>     compId.computationGeneration() = 12
> [547] NOTE at dmtcp_coordinator.cpp:437 in updateMinimumState;
> REASON='locking all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:443 in updateMinimumState;
> REASON='draining all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:449 in updateMinimumState;
> REASON='checkpointing all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:473 in updateMinimumState;
> REASON='building name service database'
> [547] NOTE at dmtcp_coordinator.cpp:489 in updateMinimumState;
> REASON='entertaining queries now'
> [547] NOTE at dmtcp_coordinator.cpp:494 in updateMinimumState;
> REASON='refilling all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:534 in updateMinimumState;
> REASON='restarting all nodes'
> [547] NOTE at dmtcp_coordinator.cpp:1145 in startCheckpoint;
> REASON='starting checkpoint, suspending all nodes'
>     s.numPeers = 1
>
>
>
> with a blip on 'top' of 'DMTCP:tacg'
>
>
>
> However, this has gone on much longer than the application should have
> gone - it only takes 1min to complete and I've been seeing this message
> cycling for well over 5min.
>
>
>
> and in between those messages, there's no evidence that my tacg app is
> consuming cycles by 'top'.
>
>
>
> Any ideas?
>
>
> hjm
>
>
>
>
> --
>
> Harry Mangalam,
>
> Info <http://moo.nac.uci.edu/~hjm/hjm.sig.html>
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to