If you're using julia 0.3, you might want to try current master and/or possibly the "ob/gctune" branch.
https://github.com/JuliaLang/julia/issues/10428 Best, --Tim On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote: > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 columns > (mostly > strings). https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data > > Loading the dataset in R using read.csv takes 5 minutes and the resulting > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my machine) > > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0 > > user system elapsed > 332.295 4.154 343.332 > > > object.size(a) > > 601496056 bytes #(0.6 GB) > > Loading the same dataset using DataFrames' readtable takes about 30 minutes > on the same machine (varies a bit, lowest is 25 minutes) and the resulting > (Julia process, REPL on Terminal, takes 6GB memory on the same machine) > > (I added couple of calls to @time macro inside the readtable function to > see whats taking time - outcomes of these calls too are below) > > julia> @time DataFrames.readtable("bids.csv"); > WARNING: Begin readnrows call > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc > time) > WARNING: End readnrows call > WARNING: Begin builddf call > elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 85.54% > gc time) > WARNING: End builddf call > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 84.12% > gc time) #total time for loading > > > Can you please suggest how I can improve load time and memory usage in > DataFrames for sizes this big and bigger? > > Thank you!
