On Saturday, October 15, 2016 at 2:54:53 AM UTC, love...@gmail.com wrote:
> I have change the code to parallel on files rather than lines. codes are 
> available here 
> <https://gist.github.com/innerlee/9b176a52b3330a1340ec94da0cdc721b> if 
> anyone have interests.
> However, the speed is not satisfactory still (total processing speed 
> approx. 10M/s, ideally it should be 100M/s, the network speed). 
> CPU not full, IO not full, and I cannot find the bottleneck...
> @Jeremy, thanks for the reply. The bottleneck is IO. You need days just to 
> stream all files at full speed. Thus waiting to load the whole file will 
> waste a lot of time. Ideally it will be that when I streamed the data one 
> pass, the processing is also done without extra time.
> @Páll, do you mean that pmap will first do a ``collect`` operation,

yes and no..

> then processing? So even you give pmap an iterator, it will not benefit 
> from it? That will be sad.

I was thinking what needs to happen, I'm still learning Julia and if I 
understand collect or what you think, then you mean does Julia first have 
to get a collection (DenseArray) of everything before starting processing?

I find it very cool to learn, to prepend @edit, to see what Julia does so I 
took a look (and I think this all means that it can start applying your 
function as you go):

     if batch_size == 1
        return collect(AsyncGenerator(f, c; ntasks=()->nworkers(p)))
        batches = batchsplit(c, min_batch_count = length(p) * 3,
                                max_batch_size = batch_size)

        results = collect(flatten(AsyncGenerator(f, batches; 

Yes, collect is used, but while the doc says:

Transform collection c by applying f

help?> Base.AsyncGenerator
  AsyncGenerator(f, c...; ntasks=0) -> iterator

  Apply f to each element of c using at most ntasks asynchronous tasks. If 
ntasks is unspecified, uses max(100, nworkers()) tasks. [..]

Reply via email to