[julia-users] Re: eachline() work with pmap() is slow

Jeremy McNees Fri, 14 Oct 2016 09:52:16 -0700

I need to run something similar due to a large number of text files that I 
have. They are too large to load into memory at one-time, let alone 
multiple files at the same time. I find that pmap() works very well here.


First, you should wrap your for loop in a function. In general you should 
block your code with functions in Julia. Second, can you provide a 
determiner to the split function? 

Third, you may not need 32 procs for this job. There's overhead associated 
with parallel processing. 

This stackoverflow post has some more information that might be useful: 
http://stackoverflow.com/questions/21890893/reading-csv-in-julia-is-slow-compared-to-python/35120894?noredirect=1#comment66827279_35120894


On Thursday, October 13, 2016 at 11:45:36 PM UTC-4, [email protected] wrote:
>
> I want to process each line of a large text file (100G) in parallel using 
> the following code
>
>     pmap(process_fun, eachline(the_file))
>
> however, it seems that pmap is slow. following is a dummy experiment:
>
> julia> writedlm("tmp.txt",rand(100000,100)) # produce a large file
> julia> @time for l in eachline("tmp.txt")
>               split(l)
>           end
>   5.678517 seconds (11.00 M allocations: 732.637 MB, 40.67% gc time)
>
> julia> addprocs() # 32 core
>
> julia> @time map(split, eachline("tmp.txt"));
>   4.834571 seconds (11.00 M allocations: 734.638 MB, 32.84% gc time)
>
> julia> @time pmap(split, eachline("tmp.txt"));
> 112.275411 seconds (227.06 M allocations: 10.024 GB, 50.72% gc time)
>
> the goal is to process those files (300+) as fast as possible. and maybe 
> there are better ways to call pmap?
>

[julia-users] Re: eachline() work with pmap() is slow

Reply via email to