Hi Anirban,

Thanks for writing to us. I have a few questions to ask in order
to further diagnose the problem:

1. What is the network type of the cluster? Is it based Ethernet
   or InfiniBand?

2. Were you running interactive jobs, or batch jobs?

3. Most likely the error indicates that some sockets are not under
   the control of DMTCP. I think it is because slurm has some extra
   socket connections to the srun process. Have you tried switching
   the order of srun and dmtcp launch? Something like this:

   srun dmtcp_launch ...


Best,
Jiajun

On Mon, Jul 24, 2017 at 11:28:41PM +0000, Nag, Anirban wrote:
> Hi,
> 
> I am running the NAS CG benchmark using DMTCP and SLURM, so I am executing 
> the following command:
> Dmtcp_launch -ib --rm srun -N 4 -n 4 cg.E.4
> 
> And then checkpointing using dmtcp_command -checkpoint
> 
> I am getting the following error:
> 
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:144 in onTimeoutInterval; 
> REASON='JWARNING(false) failed'
>      _dataSockets[i]->socket().sockfd() = 18
>      buffer.size() = 82
>      WARN_INTERVAL_SEC = 10
> Message: Still draining socket... perhaps remote host is not running under 
> DMTCP?
> [40000] WARNING at kernelbufferdrainer.cpp:70 in onConnect; 
> REASON='JWARNING(false) failed'
>      sock.sockfd() = 19
> Message: we don't yet support checkpointing non-accepted connections... 
> restore will likely fail.. closing connection
> [40000] WARNING at kernelbufferdrainer.cpp:70 in onConnect; 
> REASON='JWARNING(false) failed'
>      sock.sockfd() = 19
> Message: we don't yet support checkpointing non-accepted connections... 
> restore will likely fail.. closing connection
> slurmstepd: error: Message length of 2071343164<tel:020%207134%203164> 
> exceeds maximum of 1024
> slurmstepd: error: *** STEP 1350.0 ON server_x CANCELLED AT 
> 2017-07-24T12:41:02 DUE TO TIME LIMIT ***
> slurmstepd: error: Failed to send MESSAGE_TASK_EXIT: Transport endpoint is 
> not connected
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: Force Terminated job 1350
> srun: error: Timed out waiting for job step to complete
> 
> I am using the DMTCP 3.0.0, SLURM 16.05.4, openmpi 1.10.4
> 
> -Anirban.

> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to