Hello dear Julia fellows!

I use a for loop to batch run a clustering-script with different data and 
parameters, you can find my code in the gist 
<https://gist.github.com/axsk/91fcf1e313f4f330889a>.

Most of the work (~120s per call) is spend in the Hokusai.cluster call, 
which is a pure function, using pmap (julia -p 4) to compute clusterings 
with further different parameters (sigma, tau).
Furthermore I use ProgressMeter.jl to track the progress p, and 
JLD/DataFrames to save my results after each successful computation.

The loop is run 578 times, each time appending ~60 new rows to the 
DataFrame hdf.
So all in all this task takes about 20 hours.

Unfortunately after some hours (if I recall correctly it was 6,9,10 hours 
when I looked at it), the loop does not continue but hangs,
as can be seen by missing progress messages as well as htop.
Usually all four workers are at full load, but then only one worker is at 
full load (for multiple hours) while the others are idle.

If I restart Julia and continue the loop by loading the saved 
hdf::DataFrame it skips all so far computed entries (line 7) and resumes 
after the last saved entry (I checked that) but runs without problems, 
until it gets stuck again at some random later point.

I really do not know how to debug/solve this.
As this happens on seemingly random iterations it takes long times to 
reproduce.

Do you have any ideas what might cause this or how to find the bug?

Best,
Alex

Reply via email to