I have change the code to parallel on files rather than lines. codes are 
available here 
<https://gist.github.com/innerlee/9b176a52b3330a1340ec94da0cdc721b> if 
anyone have interests.
However, the speed is not satisfactory still (total processing speed 
approx. 10M/s, ideally it should be 100M/s, the network speed). 
CPU not full, IO not full, and I cannot find the bottleneck...

@Jeremy, thanks for the reply. The bottleneck is IO. You need days just to 
stream all files at full speed. Thus waiting to load the whole file will 
waste a lot of time. Ideally it will be that when I streamed the data one 
pass, the processing is also done without extra time.
@Páll, do you mean that pmap will first do a ``collect`` operation, then 
processing? So even you give pmap an iterator, it will not benefit from it? 
That will be sad. 

Reply via email to