On Friday, October 14, 2016 at 3:45:36 AM UTC, love...@gmail.com wrote:
>
> I want to process each line of a large text file (100G) in parallel using 
> the following code
>
>     pmap(process_fun, eachline(the_file))
>
> however, it seems that pmap is slow. following is a dummy experiment:
>
>  

> the goal is to process those files (300+) as fast as possible. and maybe 
> there are better ways to call pmap?
>

I'm not sure, there's much gain to process *each* file in parallel, on top 
of these many files (at least if they are similar size, and say one not 
much bigger).

help?> pmap
[..]
  By default, pmap distributes the computation over all specified workers.
[..]

I'm not sure how this works, since lines in a file may not each be the same 
length, I THINK you need to read the file serially (there are probably 
workaround, but pmap wouldn't be responsible for that).

The computations, would however be distributed, if they take a long time 
(compared to the I/O, well the read; else distributed=false might be a 
win?) and are independent (I guess pmap requires that) then pmap could be a 
win, but see above. Note also, parameters (batch_size=1) seem to me to be 
tuning parameters?


That's some big file.. I'm kind of interested in big [1D] arrays (see other 
thread), seems to me, this is streaming work and while file bigger, [each 
process] doesn't need more than 2 GB (a limit I'm interested in).


 

Reply via email to