If you're using julia 0.3, you might want to try current master and/or 
possibly the "ob/gctune" branch.

https://github.com/JuliaLang/julia/issues/10428

Best,
--Tim

On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
> Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 columns
> (mostly
> strings). https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
> 
> Loading the dataset in R using read.csv takes 5 minutes and the resulting
> dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my machine)
> 
> >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
> 
> user   system elapsed
> 332.295   4.154 343.332
> 
> > object.size(a)
> 
> 601496056 bytes #(0.6 GB)
> 
> Loading the same dataset using DataFrames' readtable takes about 30 minutes
> on the same machine (varies a bit, lowest is 25 minutes) and the resulting
> (Julia process, REPL on Terminal, takes 6GB memory on the same machine)
> 
> (I added couple of calls to @time macro inside the readtable function to
> see whats taking time - outcomes of these calls too are below)
> 
> julia> @time DataFrames.readtable("bids.csv");
> WARNING: Begin readnrows call
> elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc
> time)
> WARNING: End readnrows call
> WARNING: Begin builddf call
> elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 85.54%
> gc time)
> WARNING: End builddf call
> elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 84.12%
> gc time) #total time for loading
> 
> 
> Can you please suggest how I can improve load time and memory usage in
> DataFrames for sizes this big and bigger?
> 
> Thank you!

Reply via email to