Hi,

I'm analyzing a dataset that is relatively large -- basically, I'm reading JSON 
files, extracting whatever information I want into a DataFrame (usually around 
500 lines), and repeating the process 25000 times.

At the moment, my strategy is to loop through the JSON files, read them, create 
the relevant DataFrame, and rbind it to the global DataFrame. Obviously this 
results in each new dataset taking longer and longer to be added. It's 
instantaneous at the beginning, but each rbind operation takes up to a few 
seconds by the end.

I am about to try using streaming data analysis 
(http://juliastats.github.io/DataFrames.jl/datastreams.html) -- basically 
writing each small DataFrame to disk, and having a function that returns the 
rows one after the other. But I'm really curious about how people managed 
similar problems before -- the "final" dataset is likely to be 10x larger, so 
I'm going to need all the improvements I can.

Thnaks!
t

Reply via email to