Re: [julia-users] Strategy for large data frames using DataFrames.jl

John Myles White Mon, 28 Jul 2014 15:47:23 -0700

The DataStream functionality was moved out of the main DataFrames code because 
it's not sufficiently robust. We need to update the docs.

Have you tried append! on DataFrame objects? It should resolve some of the 
problems you're facing when using vcat/rbind caused by having to do more and 
more copying at each step.

Also, if you know the length of the final output, you can preallocate memory 
and get a lot of speedups. There's still a bunch of places where you'll end up 
doing slow memory allocation, but it'll get you a good chunk of the way there. 
(Most of the memory costs will be from I/O.)

 -- John

On Jul 28, 2014, at 3:33 PM, Timothée Poisot <[email protected]> wrote:

> Hi,
> 
> I'm analyzing a dataset that is relatively large -- basically, I'm reading 
> JSON files, extracting whatever information I want into a DataFrame (usually 
> around 500 lines), and repeating the process 25000 times.
> 
> At the moment, my strategy is to loop through the JSON files, read them, 
> create the relevant DataFrame, and rbind it to the global DataFrame. 
> Obviously this results in each new dataset taking longer and longer to be 
> added. It's instantaneous at the beginning, but each rbind operation takes up 
> to a few seconds by the end.
> 
> I am about to try using streaming data analysis 
> (http://juliastats.github.io/DataFrames.jl/datastreams.html) -- basically 
> writing each small DataFrame to disk, and having a function that returns the 
> rows one after the other. But I'm really curious about how people managed 
> similar problems before -- the "final" dataset is likely to be 10x larger, so 
> I'm going to need all the improvements I can.
> 
> Thnaks!
> t

Re: [julia-users] Strategy for large data frames using DataFrames.jl

Reply via email to