On Saturday, October 15, 2016 at 2:54:53 AM UTC, [email protected] wrote:
>
> I have change the code to parallel on files rather than lines. codes are
> available here
> <https://gist.github.com/innerlee/9b176a52b3330a1340ec94da0cdc721b> if
> anyone have interests.
> However, the speed is not satisfactory still (total processing speed
> approx. 10M/s, ideally it should be 100M/s, the network speed).
> CPU not full, IO not full, and I cannot find the bottleneck...
>
> @Jeremy, thanks for the reply. The bottleneck is IO. You need days just to
> stream all files at full speed. Thus waiting to load the whole file will
> waste a lot of time. Ideally it will be that when I streamed the data one
> pass, the processing is also done without extra time.
> @Páll, do you mean that pmap will first do a ``collect`` operation,
>
yes and no..
> then processing? So even you give pmap an iterator, it will not benefit
> from it? That will be sad.
>
I was thinking what needs to happen, I'm still learning Julia and if I
understand collect or what you think, then you mean does Julia first have
to get a collection (DenseArray) of everything before starting processing?
I find it very cool to learn, to prepend @edit, to see what Julia does so I
took a look (and I think this all means that it can start applying your
function as you go):
if batch_size == 1
[..]
return collect(AsyncGenerator(f, c; ntasks=()->nworkers(p)))
else
batches = batchsplit(c, min_batch_count = length(p) * 3,
max_batch_size = batch_size)
results = collect(flatten(AsyncGenerator(f, batches;
ntasks=()->nworkers(p))))
Yes, collect is used, but while the doc says:
Transform collection c by applying f
help?> Base.AsyncGenerator
AsyncGenerator(f, c...; ntasks=0) -> iterator
Apply f to each element of c using at most ntasks asynchronous tasks. If
ntasks is unspecified, uses max(100, nworkers()) tasks. [..]