On this Julia version:
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.5.0 (2016-09-19 18:14 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-pc-linux-gnu
running on:
Ubuntu 14.04.3 LTS
I am trying to do a Monte Carlo simulation in parallel across 36 workers.
I have two problems (at least).
1. Some of the workers terminate at the beginning of the simulation, but I
don't understand the error message:
Worker 5 terminated.ERROR (unhandled task failure): ProcessExitedException()
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in wait(::Channel{Any}) at ./channels.jl:92
in take!(::Channel{Any}) at ./channels.jl:73
in #remotecall_fetch#606(::Array{Any,1}, ::Function, ::Function,
::Base.Worker, ::Function, ::Vararg{Any,N}) at ./multi.jl:1066
in remotecall_fetch(::Function, ::Base.Worker, ::Function,
::Vararg{Any,N}) at ./multi.jl:1062
in #remotecall_fetch#609(::Array{Any,1}, ::Function, ::Function, ::Int64,
::Function, ::Vararg{Any,N}) at ./multi.jl:1080
in remotecall_fetch(::Function, ::Int64, ::Function, ::Vararg{Any,N}) at
./multi.jl:1080
in
(::Base.##667#668{Base.#+,ProjectModule.##45#47{Int64,Array{Any,1},Array{Any,2}},UnitRange{Int64},Array{UnitRange{Int64},1}})()
at ./multi.jl:1998
This is not a huge problem as the rest of the workers keep going and can
finish the simulation, but I would like to understand what is going on, if
possible. (And maybe how to fix it so as to use those workers.)
2. The more important problem is that at the end of the simulation, I run
into other errors and nothing is returned. My (uninformed and probably
wrong) guess is that there is something the program doesn't like about the
fact that the different workers are finishing at different times? The
errors I get are:
ERROR (unhandled task failure): EOFError: read end of file
Worker 16 terminated.ERROR (unhandled task failure):
ProcessExitedException()
in yieldto(::Task, ::ANY) at ./event.jl:136
in wait() at ./event.jl:169
in wait(::Condition) at ./event.jl:27
in wait(::Channel{Any}) at ./channels.jl:92
in take!(::Channel{Any}) at ./channels.jl:73
in #remotecall_fetch#606(::Array{Any,1}, ::Function, ::Function,
::Base.Worker, ::Function, ::Vararg{Any,N}) at ./multi.jl:1066
in remotecall_fetch(::Function, ::Base.Worker, ::Function,
::Vararg{Any,N}) at ./multi.jl:1062
in #remotecall_fetch#609(::Array{Any,1}, ::Function, ::Function, ::Int64,
::Function, ::Vararg{Any,N}) at ./multi.jl:1080
in remotecall_fetch(::Function, ::Int64, ::Function, ::Vararg{Any,N}) at
./multi.jl:1080
in
(::Base.##667#668{Base.#+,ProjectModule.##45#47{Int64,Array{Any,1},Array{Any,2}},UnitRange{Int64},Array{UnitRange{Int64},1}})()
at ./multi.jl:1998
And -
ERROR: LoadError: ProcessExitedException()
in wait(::Task) at ./task.jl:135
in collect_to!(::Array{Array{Float64,2},1},
::Base.Generator{Array{Task,1},Base.#wait}, ::Int64, ::Int64) at
./array.jl:340
in collect(::Base.Generator{Array{Task,1},Base.#wait}) at ./array.jl:308
in preduce(::Function, ::Function, ::UnitRange{Int64}) at ./multi.jl:2002
in (::ProjectModule.##44#46{Int64,Array{Any,1},Array{Any,2},Int64})() at
./multi.jl:2011
in macro expansion at ./task.jl:326 [inlined]
in #OuterSim#43(::Int64, ::Int64, ::Int64, ::Array{Any,1}, ::Array{Any,2},
::Function, ::Int64) at /home/ubuntu/dynhosp/DataStructs.jl:1321
in (::ProjectModule.#kw##OuterSim)(::Array{Any,1},
::ProjectModule.#OuterSim, ::Int64) at ./<missing>:0
in include_from_node1(::String) at ./loading.jl:488
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
while loading /home/ubuntu/dynhosp/Run.jl, in expression starting on line 9
And finally:
ERROR (unhandled task failure): On worker 9:
ArgumentError: Dict(kv): kv needs to be an iterator of tuples or pairs
in Type at ./dict.jl:388
in CalcWTP at /home/ubuntu/dynhosp/DataStructs.jl:728
in WTPMap at /home/ubuntu/dynhosp/DataStructs.jl:747
in #PSim#32 at /home/ubuntu/dynhosp/DataStructs.jl:1024
in #45 at ./multi.jl:2016
in #625 at ./multi.jl:1421
in run_work_thunk at ./multi.jl:1001
in macro expansion at ./multi.jl:1421 [inlined]
in #624 at ./event.jl:68
in #remotecall_fetch#606(::Array{Any,1}, ::Function, ::Function,
::Base.Worker, ::Function, ::Vararg{Any,N}) at ./multi.jl:1070
in remotecall_fetch(::Function, ::Base.Worker, ::Function,
::Vararg{Any,N}) at ./multi.jl:1062
in #remotecall_fetch#609(::Array{Any,1}, ::Function, ::Function, ::Int64,
::Function, ::Vararg{Any,N}) at ./multi.jl:1080
in remotecall_fetch(::Function, ::Int64, ::Function, ::Vararg{Any,N}) at
./multi.jl:1080
in
(::Base.##667#668{Base.#+,ProjectModule.##45#47{Int64,Array{Any,1},Array{Any,2}},UnitRange{Int64},Array{UnitRange{Int64},1}})()
at ./multi.jl:1998
The actual function I am calling is:
function OuterSim(MCcount::Int; T1::Int64 = 3, dim1::Int64 = 290, dim2::
Int64 = 67, fi = fips, da = data05)
outp = @sync @parallel (+) for j = 1:MCcount
Texas = MakeNew(fi, da);
eq_patients = NewPatients()
neq_patients = NewPatients()
ResultsOut(NewSim(T1, Texas, eq_patients), PSim(T1, neq_patients); T =
T1)
end
outp[:,1] = outp[:,1]/MCcount
return outp
end
I added the "@sync" following the suggestion of a colleague here - I am not
sure it's necessary. (FWIW - I get the errors above on Ubuntu whether I
include it or not.)
This code *does* run and terminate without error on my own home machine
(running OS-X, also v0.5), which has only four cores.
I would love your feedback!
Thanks -
AB