I'm trying to get DMTCP setup for my users. I installed DMTCP from the 2.6.1~rc1 RPM available from the EPEL repository. I'm trying to get to work with Slurm 20.11.3, and Open MPI 4.0.3.

Is there a command I can use to see what options it was built or configured with, so I can tell if this version has IB support?

When I try to run my job, I get these errors:

[40000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [40000] WARNING at signalwrappers.cpp:141 in sigaction; REASON='JWARNING(false) failed'      "Application trying to use DMTCP's signal for it's own use.\n" "  You should employ a different signal by setting the\n" "  environment variable DMTCP_SIGCKPT to the number\n" "  of the signal that DMTCP should use for checkpointing." = Application trying to use DMTCP's signal for it's own use.
  You should employ a different signal by setting the
  environment variable DMTCP_SIGCKPT to the number
  of the signal that DMTCP should use for checkpointing.
     stopSignal = 10
[40000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2

And

Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] ERROR at sysvipcwrappers.cpp:169 in shmctl; REASON='JASSERT(realShmid != -1) failed'
dmtcp_hello (41000): Terminating...
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[12118,1],0]
  Exit code:    99
--------------------------------------------------------------------------
[40000] WARNING at signalwrappers.cpp:141 in sigaction; REASON='JWARNING(false) failed'      "Application trying to use DMTCP's signal for it's own use.\n" "  You should employ a different signal by setting the\n" "  environment variable DMTCP_SIGCKPT to the number\n" "  of the signal that DMTCP should use for checkpointing." = Application trying to use DMTCP's signal for it's own use.
  You should employ a different signal by setting the
  environment variable DMTCP_SIGCKPT to the number
  of the signal that DMTCP should use for checkpointing.
     stopSignal = 10

My sbatch sumbit script looks like this:

#!/bin/bash

#SBATCH -p general
#SBATCH -t 00:01:00
#SBATCH -n 2
#SBATCH --mem=1000
#SBATCH --export=ALL
#SBATCH -J dmctp_hello
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --mail-type=ALL

module load gcc
module load openmpi

# 0. Set up DMTCP environment for a job

start_coordinator()
{
    ############################################################
    # For debugging when launching a custom coordinator, uncomment
    # the following lines and provide the proper host and port for
    # the coordinator.
    ############################################################
    # export DMTCP_COORD_HOST=$h
    # export DMTCP_COORD_PORT=$p
    # return

    fname=dmtcp_command.$SLURM_JOBID
    h=$(hostname)

    check_coordinator=$(which dmtcp_coordinator)
    if [ -z "$check_coordinator" ]; then
        echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings."
        exit 0
    fi

    dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

    while true; do
        if [ -f "$fname" ]; then
            p=$(cat $fname)
            if [ -n "$p" ]; then
                # try to communicate ? dmtcp_command -p $p l
                break
            fi
        fi
    done

    # Create dmtcp_command wrapper for easy communication with coordinator
    p=$(cat $fname)
    chmod +x $fname
    echo "#!/bin/bash" > $fname
    echo >> $fname
    echo "export PATH=$PATH" >> $fname
    echo "export DMTCP_COORD_HOST=$h" >> $fname
    echo "export DMTCP_COORD_PORT=$p" >> $fname
    echo "dmtcp_command \$@" >> $fname

    # Set up local environment for DMTCP
    export DMTCP_COORD_HOST=$h
    export DMTCP_COORD_PORT=$p
    export DMTCP_SIGCKPT=10

}

# 1. Start DMTCP coordinator
start_coordinator -i 10


# 2. Launch application
#srun /usr/bin/dmtcp_launch --rm ./dmctp_hello
dmtcp_launch --rm mpirun /u/pbisbal/testing/dmtcp/dmtcp_hello

This sbatch script is essentially the same as the one here:

https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job

I just removed some comments to make smaller and easier to read.

--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov



_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to