Hello World,
I am relatively new with Julia. I wrote an optimization model (MIP) that I
need to run many, many time for sensitivity analysis. I am working on a
cluster that uses SLURM. I wrote my model as a Julia Module.
Basically, the process that I want to do is to have a file with all the
different cases; and using a "for-loop" I want each case to be solved by a
different node (all cores in the node solving the same MIP case).
I have tried two different ways: (1) using the machinefile option of Julia
(see .sh file code below).
#!/bin/bash
#SBATCH --uid=nsep
#SBATCH --job-name="juliaTest"
#SBATCH --partition=newnodes
#SBATCH --output="juliaTest.%j.%N.out"
#SBATCH --error="juliaTest.%j.%N.err"
#SBATCH --time=1:0:0
#SBATCH -N 3
##SBATCH -n 20
#SBATCH --export=ALL
export SLURM_NODEFILE=`generate_pbs_nodefile`
. /etc/profile.d/modules.sh
module add engaging/julia/0.4.3
module add engaging/gurobi/6.5.1
julia --machinefile $SLURM_NODEFILE ~/Cases.jl
Using this method I get an error when loading MyModule (the model)
@everywhere push!(LOAD_PATH, "/home/nsep/Test")
@everywhere using MyModule
@everywhere using DataFrames
@everywhere inpath="/home/nsep/Test/Input"
@everywhere outpath="/home/nsep/Test/Results"
mysetup=Dict() # config. options for MyModule
@everywhere mysetup
casepath="/home/nsep/Test"
@everywhere cases_in_data = readtable("$casepath/Cases_Control.csv",
header=true)
@parallel for c in 1:size(cases_in_data,1)
#loading general inputs
myinputs = Load_inputs(mysetup,inpath)
#creating output directory
mkdir("$outpath/Case$c")
case_outpath="$outpath/Case$c"
#case-specific inputs
myinputs["pMaxCO2"][1]=cases_in_data[:Emissions][c]
myresults = solve_model(mysetup,myinputs)
write_outputs(mysetup,case_outpath,myresults,myinputs)
end
The error that I get is:
WARNING: replacing module MyModule
WARNING: replacing module MyModule
WARNING: replacing module MyModule
signal (11): Segmentation fault
jl_module_using at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
unknown function (ip: 0x2aaaaae0def9)
unknown function (ip: 0x2aaaaae0e1e5)
unknown function (ip: 0x2aaaaae0de3d)
unknown function (ip: 0x2aaaaae0e77c)
jl_load_file_string at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
include_string at loading.jl:266
jl_apply_generic at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
include_from_node1 at ./loading.jl:307
jl_apply_generic at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
unknown function (ip: 0x2aaaaadf92a3)
unknown function (ip: 0x2aaaaadf8639)
unknown function (ip: 0x2aaaaae0daac)
jl_toplevel_eval_in at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
eval at ./sysimg.jl:14
jl_apply_generic at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
anonymous at multi.jl:1364
jl_f_apply at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
anonymous at multi.jl:910
run_work_thunk at multi.jl:651
run_work_thunk at multi.jl:660
jlcall_run_work_thunk_21367 at (unknown line)
jl_apply_generic at
/cm/shared/engaging/julia/julia-a2f713dea5/bin/../lib/julia/libjulia.so
(unknown line)
anonymous at task.jl:58
unknown function (ip: 0x2aaaaadff514)
unknown function (ip: (nil))
sh: line 1: 18358 Segmentation fault
/cm/shared/engaging/julia/julia-a2f713dea5/bin/julia --worker
Worker 2 terminated.
ERROR (unhandled task failure): EOFError: read end of file
in read at stream.jl:911
in message_handler_loop at multi.jl:868
in process_tcp_streams at multi.jl:857
in anonymous at task.jl:63
ERROR: LoadError: ProcessExitedException()
in yieldto at ./task.jl:71
in wait at ./task.jl:371
in wait at ./task.jl:286
in wait at ./channels.jl:63
in take! at ./channels.jl:53
in take! at ./multi.jl:809
in remotecall_fetch at multi.jl:735
in remotecall_fetch at multi.jl:740
in anonymous at multi.jl:1386
...and 1 other exceptions.
in sync_end at ./task.jl:413
in anonymous at multi.jl:1395
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:280
in _start at ./client.jl:378
while loading /home/nsep/Cases.jl, in expression starting on line 3
(2) The other method I tried was using ClusterManagers.jl, .sh file below.
#!/bin/bash
#SBATCH --uid=nsep
#SBATCH --job-name="juliaTest"
#SBATCH --partition=newnodes
#SBATCH --output="juliaTest.%j.%N.out"
#SBATCH --error="juliaTest.%j.%N.err"
#SBATCH --time=0:2:0
#SBATCH -N 4
#SBATCH --export=ALL
. /etc/profile.d/modules.sh
module add engaging/julia/0.4.3
module add engaging/gurobi/6.5.1
julia ~/julia_cluster.jl
and then in the Julia code I tried to run the SLURM:example in the
ClusterManagers page.
using ClusterManagers
# Arguments to the Slurm srun(1) command can be given as keyword
# arguments to addprocs. The argument name and value is translated to
# a srun(1) command line argument as follows:
# 1) If the length of the argument is 1 => "-arg value",
# e.g. t="0:1:0" => "-t 0:1:0"
# 2) If the length of the argument is > 1 => "--arg=value"
# e.g. time="0:1:0" => "--time=0:1:0"
# 3) If the value is the empty string, it becomes a flag value,
# e.g. exclusive="" => "--exclusive"
# 4) If the argument contains "_", they are replaced with "-",
# e.g. mem_per_cpu=100 => "--mem-per-cpu=100"
addprocs(SlurmManager(4), partition="newnodes", t="00:2:00")
hosts = []
pids = []
for i in workers()
host, pid = fetch(@spawnat i (gethostname(), getpid()))
push!(hosts, host)
push!(pids, pid)
end
# The Slurm resource allocation is released when all the workers have
# exited
for i in workers()
rmprocs(i)
end
But I get this error:
Error launching Slurm job:
MethodError(length,(:all_to_all,))
If anyone could help figure out (1) what is wrong in my code when passing
MyModule, (2) waht I am doing wrong when trying ClusterManagers, that would
be AWESOME!