In the long run, we may probably need a common infrastructure for data streams -- not only for large data frames, but also for a stream of images/column vectors/whatever objects ... from any sources (single huge file, multiple smaller files, network, ...)
Dahua On Monday, July 28, 2014 5:46:58 PM UTC-5, John Myles White wrote: > > The DataStream functionality was moved out of the main DataFrames code > because it's not sufficiently robust. We need to update the docs. > > Have you tried append! on DataFrame objects? It should resolve some of the > problems you're facing when using vcat/rbind caused by having to do more > and more copying at each step. > > Also, if you know the length of the final output, you can preallocate > memory and get a lot of speedups. There's still a bunch of places where > you'll end up doing slow memory allocation, but it'll get you a good chunk > of the way there. (Most of the memory costs will be from I/O.) > > -- John > > On Jul 28, 2014, at 3:33 PM, Timothée Poisot <[email protected] > <javascript:>> wrote: > > > Hi, > > > > I'm analyzing a dataset that is relatively large -- basically, I'm > reading JSON files, extracting whatever information I want into a DataFrame > (usually around 500 lines), and repeating the process 25000 times. > > > > At the moment, my strategy is to loop through the JSON files, read them, > create the relevant DataFrame, and rbind it to the global DataFrame. > Obviously this results in each new dataset taking longer and longer to be > added. It's instantaneous at the beginning, but each rbind operation takes > up to a few seconds by the end. > > > > I am about to try using streaming data analysis ( > http://juliastats.github.io/DataFrames.jl/datastreams.html) -- basically > writing each small DataFrame to disk, and having a function that returns > the rows one after the other. But I'm really curious about how people > managed similar problems before -- the "final" dataset is likely to be 10x > larger, so I'm going to need all the improvements I can. > > > > Thnaks! > > t > >
