The DataStream functionality was moved out of the main DataFrames code because it's not sufficiently robust. We need to update the docs.
Have you tried append! on DataFrame objects? It should resolve some of the problems you're facing when using vcat/rbind caused by having to do more and more copying at each step. Also, if you know the length of the final output, you can preallocate memory and get a lot of speedups. There's still a bunch of places where you'll end up doing slow memory allocation, but it'll get you a good chunk of the way there. (Most of the memory costs will be from I/O.) -- John On Jul 28, 2014, at 3:33 PM, Timothée Poisot <[email protected]> wrote: > Hi, > > I'm analyzing a dataset that is relatively large -- basically, I'm reading > JSON files, extracting whatever information I want into a DataFrame (usually > around 500 lines), and repeating the process 25000 times. > > At the moment, my strategy is to loop through the JSON files, read them, > create the relevant DataFrame, and rbind it to the global DataFrame. > Obviously this results in each new dataset taking longer and longer to be > added. It's instantaneous at the beginning, but each rbind operation takes up > to a few seconds by the end. > > I am about to try using streaming data analysis > (http://juliastats.github.io/DataFrames.jl/datastreams.html) -- basically > writing each small DataFrame to disk, and having a function that returns the > rows one after the other. But I'm really curious about how people managed > similar problems before -- the "final" dataset is likely to be 10x larger, so > I'm going to need all the improvements I can. > > Thnaks! > t
