OK, I've done a little more digging.
It seems that in v0.4, remote workers are started differently. This is my
understanding:
Only one worker for each host is started directly from the master process.
Additional workers on each host are started from the first worker on that
host.
Thus output from these additional workers is routed via the first worker on
the host (rather than directly to master process).
Somehow this causes the intermingled output.
To overcome this, I can start all workers directly from the master process,
and output is orderly again (as for v0.3).
Presumably, the new v0.4 indirect method was to speed up adding remote
workers.
Clearly, I don't really understand much of this. And I'm not sure how
connecting all workers directly to master process affects performance or
scalability.
Intuitively, it doesn't sound good, but for my purpose it does give more
readable output.
To help speed up the startup of workers, I can start workers on different
hosts in parallel (but each worker on host is started serially and directly
from master process)
@sync begin
for each (host, nworkers) in machines
@async begin
for i = 1:nworkers
addprocs([(host,1)])
end
end
end
end