Hi,
I'm trying to launch a parallel julia session with 4 processes (workers): 2
in host01 and 2 in host02 (launching from host01) but it's not working and
it's kind of weird.
I can start two workers, one in host01 and one in host02, with the
following machinefile.txt
host02
host01
and typing:
$ julia --machinefile machinefile.txt
Then, for example:
julia> @everywhere run(`hostname`)
host01
From worker 2: host01
From worker 3: host02
I can also add 2 workers in host01, wher machinefile.txt now is:
host01
host02
host01
host01
and typing:
julia> @everywhere run(`hostname`)
host01
From worker 3: host02
From worker 2: host01
From worker 5: host01
From worker 4: host01
However, if I want to start 2 workers on each machine, with the following
machinefile.txt:
host01
host01
host02
host02
Then it hangs, and after a while I get an error:
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
Master process (id 1) could not connect within 60.0 seconds.
exiting.
ERROR: connect: connection timed out (ETIMEDOUT)
in wait at ./task.jl:284
in wait at ./task.jl:194
in stream_wait at stream.jl:263
in wait_connected at stream.jl:301
in Worker at multi.jl:113
in create_worker at multi.jl:1064
in start_cluster_workers at multi.jl:1028
in addprocs_internal at multi.jl:1234
in addprocs at multi.jl:1244
in process_options at ./client.jl:240
in _start at ./client.jl:354
in _start_3B_1716 at /usr/bin/../lib/x86_64-linux-gnu/julia/sys.so
Is it my network/system ???
Thanks.