On Friday, October 14, 2016 at 3:45:36 AM UTC, love...@gmail.com wrote: > > I want to process each line of a large text file (100G) in parallel using > the following code > > pmap(process_fun, eachline(the_file)) > > however, it seems that pmap is slow. following is a dummy experiment: > >
> the goal is to process those files (300+) as fast as possible. and maybe > there are better ways to call pmap? > I'm not sure, there's much gain to process *each* file in parallel, on top of these many files (at least if they are similar size, and say one not much bigger). help?> pmap [..] By default, pmap distributes the computation over all specified workers. [..] I'm not sure how this works, since lines in a file may not each be the same length, I THINK you need to read the file serially (there are probably workaround, but pmap wouldn't be responsible for that). The computations, would however be distributed, if they take a long time (compared to the I/O, well the read; else distributed=false might be a win?) and are independent (I guess pmap requires that) then pmap could be a win, but see above. Note also, parameters (batch_size=1) seem to me to be tuning parameters? That's some big file.. I'm kind of interested in big [1D] arrays (see other thread), seems to me, this is streaming work and while file bigger, [each process] doesn't need more than 2 GB (a limit I'm interested in).