[julia-users] Re: eachline() work with pmap() is slow

Páll Haraldsson Sun, 16 Oct 2016 10:58:47 -0700


On Saturday, October 15, 2016 at 2:54:53 AM UTC, [email protected] wrote:
>
> I have change the code to parallel on files rather than lines. codes are 
> available here 
> <https://gist.github.com/innerlee/9b176a52b3330a1340ec94da0cdc721b> if 
> anyone have interests.
> However, the speed is not satisfactory still (total processing speed 
> approx. 10M/s, ideally it should be 100M/s, the network speed). 
> CPU not full, IO not full, and I cannot find the bottleneck...
>
> @Jeremy, thanks for the reply. The bottleneck is IO. You need days just to 
> stream all files at full speed. Thus waiting to load the whole file will 
> waste a lot of time. Ideally it will be that when I streamed the data one 
> pass, the processing is also done without extra time.
> @Páll, do you mean that pmap will first do a ``collect`` operation,
>


yes and no..
 

> then processing? So even you give pmap an iterator, it will not benefit 
> from it? That will be sad.
>

I was thinking what needs to happen, I'm still learning Julia and if I 
understand collect or what you think, then you mean does Julia first have 
to get a collection (DenseArray) of everything before starting processing?


I find it very cool to learn, to prepend @edit, to see what Julia does so I 
took a look (and I think this all means that it can start applying your 
function as you go):

     if batch_size == 1
[..]
        return collect(AsyncGenerator(f, c; ntasks=()->nworkers(p)))
    else
        batches = batchsplit(c, min_batch_count = length(p) * 3,
                                max_batch_size = batch_size)

        results = collect(flatten(AsyncGenerator(f, batches; 
ntasks=()->nworkers(p))))


Yes, collect is used, but while the doc says:

Transform collection c by applying f


help?> Base.AsyncGenerator
  AsyncGenerator(f, c...; ntasks=0) -> iterator

  Apply f to each element of c using at most ntasks asynchronous tasks. If 
ntasks is unspecified, uses max(100, nworkers()) tasks. [..]

[julia-users] Re: eachline() work with pmap() is slow

Reply via email to