In the long run, we may probably need a common infrastructure for data 
streams -- not only for large data frames, but also for a stream of 
images/column vectors/whatever objects ... from any sources (single huge 
file, multiple smaller files, network, ...)

Dahua


On Monday, July 28, 2014 5:46:58 PM UTC-5, John Myles White wrote:
>
> The DataStream functionality was moved out of the main DataFrames code 
> because it's not sufficiently robust. We need to update the docs. 
>
> Have you tried append! on DataFrame objects? It should resolve some of the 
> problems you're facing when using vcat/rbind caused by having to do more 
> and more copying at each step. 
>
> Also, if you know the length of the final output, you can preallocate 
> memory and get a lot of speedups. There's still a bunch of places where 
> you'll end up doing slow memory allocation, but it'll get you a good chunk 
> of the way there. (Most of the memory costs will be from I/O.) 
>
>  -- John 
>
> On Jul 28, 2014, at 3:33 PM, Timothée Poisot <[email protected] 
> <javascript:>> wrote: 
>
> > Hi, 
> > 
> > I'm analyzing a dataset that is relatively large -- basically, I'm 
> reading JSON files, extracting whatever information I want into a DataFrame 
> (usually around 500 lines), and repeating the process 25000 times. 
> > 
> > At the moment, my strategy is to loop through the JSON files, read them, 
> create the relevant DataFrame, and rbind it to the global DataFrame. 
> Obviously this results in each new dataset taking longer and longer to be 
> added. It's instantaneous at the beginning, but each rbind operation takes 
> up to a few seconds by the end. 
> > 
> > I am about to try using streaming data analysis (
> http://juliastats.github.io/DataFrames.jl/datastreams.html) -- basically 
> writing each small DataFrame to disk, and having a function that returns 
> the rows one after the other. But I'm really curious about how people 
> managed similar problems before -- the "final" dataset is likely to be 10x 
> larger, so I'm going to need all the improvements I can. 
> > 
> > Thnaks! 
> > t 
>
>

Reply via email to