Hi, I'm analyzing a dataset that is relatively large -- basically, I'm reading JSON files, extracting whatever information I want into a DataFrame (usually around 500 lines), and repeating the process 25000 times.
At the moment, my strategy is to loop through the JSON files, read them, create the relevant DataFrame, and rbind it to the global DataFrame. Obviously this results in each new dataset taking longer and longer to be added. It's instantaneous at the beginning, but each rbind operation takes up to a few seconds by the end. I am about to try using streaming data analysis (http://juliastats.github.io/DataFrames.jl/datastreams.html) -- basically writing each small DataFrame to disk, and having a function that returns the rows one after the other. But I'm really curious about how people managed similar problems before -- the "final" dataset is likely to be 10x larger, so I'm going to need all the improvements I can. Thnaks! t
