Hello, world!

I'm trying to setup DMTCP to work with Slurm in my environment for my users.

I'm using DMTCP dmtcp-2.6.1~rc1-0.1.el7.x86_64, which I installed from the EPEL repo. I'm using Slurm 20.11.3 and OpenMPI 4.0.3. My program is a simple MPI-based "Hello,world" style program that prints out messages from each of the MPI ranks every 2 seconds.

I created an sbatch file using the example found at https://github.com/dmtcp/dmtcp/blob/master/plugin/batch-queue/job_examples/slurm_launch.job. Basically, I just removed the comments to made it more readable, and briefer for me to post here:

#!/bin/bash

#SBATCH -p general
#SBATCH -t 00:01:00
#SBATCH -n 2
#SBATCH --mem=1000
#SBATCH --export=ALL
#SBATCH -J dmctp_hello
#SBATCH -o %x-%j.out
#SBATCH -e %x-%j.err
#SBATCH --mail-type=ALL

module load gcc/9.3.0
module load openmpi/4.0.3

# 0. Set up DMTCP environment for a job

start_coordinator()
{
    ############################################################
    # For debugging when launching a custom coordinator, uncomment
    # the following lines and provide the proper host and port for
    # the coordinator.
    ############################################################
    # export DMTCP_COORD_HOST=$h
    # export DMTCP_COORD_PORT=$p
    # return

    fname=dmtcp_command.$SLURM_JOBID
    h=$(hostname)

    check_coordinator=$(which dmtcp_coordinator)
    if [ -z "$check_coordinator" ]; then
        echo "No dmtcp_coordinator found. Check your DMTCP installation and PATH settings."
        exit 0
    fi

    dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1

    while true; do
        if [ -f "$fname" ]; then
            p=$(cat $fname)
            if [ -n "$p" ]; then
                # try to communicate ? dmtcp_command -p $p l
                break
            fi
        fi
    done

    # Create dmtcp_command wrapper for easy communication with coordinator
    p=$(cat $fname)
    chmod +x $fname
    echo "#!/bin/bash" > $fname
    echo >> $fname
    echo "export PATH=$PATH" >> $fname
    echo "export DMTCP_COORD_HOST=$h" >> $fname
    echo "export DMTCP_COORD_PORT=$p" >> $fname
    echo "dmtcp_command \$@" >> $fname

    # Set up local environment for DMTCP
    export DMTCP_COORD_HOST=$h
    export DMTCP_COORD_PORT=$p
    export DMTCP_SIGCKPT=10

}

# 1. Start DMTCP coordinator
start_coordinator -i 10


# 2. Launch application
srun /usr/bin/dmtcp_launch --rm ./dmctp_hello
# dmtcp_launch --rm mpirun /u/pbisbal/testing/dmtcp/dmtcp_hello

When I submit the job using srun as shown above, the only output for my job is this message, written to standard err:

srun: error: ellis002: tasks 0-1: Exited with exit code 99

where ellis002 is the node that was alocated to the job.

When I replace the srun command with this one:

dmtcp_launch --rm mpirun /u/pbisbal/testing/dmtcp/dmtcp_hello

I get a lot more errors:

[40000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [40000] WARNING at signalwrappers.cpp:141 in sigaction; REASON='JWARNING(false) failed'      "Application trying to use DMTCP's signal for it's own use.\n" "  You should employ a different signal by setting the\n" "  environment variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP should use for checkpointing." = Application trying to use DMTCP's signal for it's own use.
  You should employ a different signal by setting the
  environment variable DMTCP_SIGCKPT to the number
  of the signal that DMTCP should use for checkpointing.
     stopSignal = 10
[40000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [42000] WARNING at socketconnection.cpp:222 in TcpConnection; REASON='JWARNING(false) failed'
     type = 2
Message: Datagram Sockets not supported. Hopefully, this is a short lived connection! [41000] ERROR at sysvipcwrappers.cpp:169 in shmctl; REASON='JASSERT(realShmid != -1) failed'
dmtcp_hello (41000): Terminating...
[42000] ERROR at sysvipcwrappers.cpp:169 in shmctl; REASON='JASSERT(realShmid != -1) failed'
dmtcp_hello (42000): Terminating...
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[57709,1],1]
  Exit code:    99
--------------------------------------------------------------------------
[40000] WARNING at signalwrappers.cpp:141 in sigaction; REASON='JWARNING(false) failed'      "Application trying to use DMTCP's signal for it's own use.\n" "  You should employ a different signal by setting the\n" "  environment variable DMTCP_SIGCKPT to the number\n" " of the signal that DMTCP should use for checkpointing." = Application trying to use DMTCP's signal for it's own use.
  You should employ a different signal by setting the
  environment variable DMTCP_SIGCKPT to the number
  of the signal that DMTCP should use for checkpointing.
     stopSignal = 10

I did a search for "Application trying to use DMTCP's signal for it's own use", and I found this link:

http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_(DMTCP)/

I added

export DMTCP_SIGCKPT=10

to my sbatch script as they advised (and as shown above), but it didn't help.

Any suggestions?


--
Prentice


_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to