On Friday, October 14, 2016 at 3:45:36 AM UTC, love...@gmail.com wrote:
> I want to process each line of a large text file (100G) in parallel using
> the following code
> pmap(process_fun, eachline(the_file))
> however, it seems that pmap is slow. following is a dummy experiment:
> the goal is to process those files (300+) as fast as possible. and maybe
> there are better ways to call pmap?
I'm not sure, there's much gain to process *each* file in parallel, on top
of these many files (at least if they are similar size, and say one not
By default, pmap distributes the computation over all specified workers.
I'm not sure how this works, since lines in a file may not each be the same
length, I THINK you need to read the file serially (there are probably
workaround, but pmap wouldn't be responsible for that).
The computations, would however be distributed, if they take a long time
(compared to the I/O, well the read; else distributed=false might be a
win?) and are independent (I guess pmap requires that) then pmap could be a
win, but see above. Note also, parameters (batch_size=1) seem to me to be
That's some big file.. I'm kind of interested in big [1D] arrays (see other
thread), seems to me, this is streaming work and while file bigger, [each
process] doesn't need more than 2 GB (a limit I'm interested in).