I'm trying to get DMTCP setup for my users. I installed DMTCP from the
2.6.1~rc1 RPM available from the EPEL repository. I'm trying to get to
work with Slurm 20.11.3, and Open MPI 4.0.3.
Is there a command I can use to see what options it was built or
configured with, so I can tell if this version has IB support?
When I try to run my job, I get these errors:
[40000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[40000] WARNING at signalwrappers.cpp:141 in sigaction;
REASON='JWARNING(false) failed'
"Application trying to use DMTCP's signal for it's own use.\n" "
You should employ a different signal by setting the\n" " environment
variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP
should use for checkpointing." = Application trying to use DMTCP's
signal for it's own use.
You should employ a different signal by setting the
environment variable DMTCP_SIGCKPT to the number
of the signal that DMTCP should use for checkpointing.
stopSignal = 10
[40000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[42000] WARNING at socketconnection.cpp:222 in TcpConnection;
REASON='JWARNING(false) failed'
type = 2
And
Message: Datagram Sockets not supported. Hopefully, this is a short
lived connection!
[41000] ERROR at sysvipcwrappers.cpp:169 in shmctl;
REASON='JASSERT(realShmid != -1) failed'
dmtcp_hello (41000): Terminating...
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status,
thus causing
the job to be terminated. The first process to do so was:
Process name: [[12118,1],0]
Exit code: 99
--------------------------------------------------------------------------
[40000] WARNING at signalwrappers.cpp:141 in sigaction;
REASON='JWARNING(false) failed'
"Application trying to use DMTCP's signal for it's own use.\n" "
You should employ a different signal by setting the\n" " environment
variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP
should use for checkpointing." = Application trying to use DMTCP's
signal for it's own use.
You should employ a different signal by setting the
environment variable DMTCP_SIGCKPT to the number
of the signal that DMTCP should use for checkpointing.
stopSignal = 10
My sbatch sumbit script looks like this:
#!/bin/bash
#SBATCH -p general
#SBATCH -t 00:01:00
#SBATCH -n 2
#SBATCH --mem=1000
#SBATCH --export=ALL
#SBATCH -J dmctp_hello
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --mail-type=ALL
module load gcc
module load openmpi
# 0. Set up DMTCP environment for a job
start_coordinator()
{
############################################################
# For debugging when launching a custom coordinator, uncomment
# the following lines and provide the proper host and port for
# the coordinator.
############################################################
# export DMTCP_COORD_HOST=$h
# export DMTCP_COORD_PORT=$p
# return
fname=dmtcp_command.$SLURM_JOBID
h=$(hostname)
check_coordinator=$(which dmtcp_coordinator)
if [ -z "$check_coordinator" ]; then
echo "No dmtcp_coordinator found. Check your DMTCP installation
and PATH settings."
exit 0
fi
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname
$@ 1>/dev/null 2>&1
while true; do
if [ -f "$fname" ]; then
p=$(cat $fname)
if [ -n "$p" ]; then
# try to communicate ? dmtcp_command -p $p l
break
fi
fi
done
# Create dmtcp_command wrapper for easy communication with coordinator
p=$(cat $fname)
chmod +x $fname
echo "#!/bin/bash" > $fname
echo >> $fname
echo "export PATH=$PATH" >> $fname
echo "export DMTCP_COORD_HOST=$h" >> $fname
echo "export DMTCP_COORD_PORT=$p" >> $fname
echo "dmtcp_command \$@" >> $fname
# Set up local environment for DMTCP
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p
export DMTCP_SIGCKPT=10
}
# 1. Start DMTCP coordinator
start_coordinator -i 10
# 2. Launch application
#srun /usr/bin/dmtcp_launch --rm ./dmctp_hello
dmtcp_launch --rm mpirun /u/pbisbal/testing/dmtcp/dmtcp_hello
This sbatch script is essentially the same as the one here:
https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job
I just removed some comments to make smaller and easier to read.
--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum