Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 columns 
(mostly 
strings). https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data

Loading the dataset in R using read.csv takes 5 minutes and the resulting 
dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my machine)

>t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
user   system elapsed 
332.295   4.154 343.332 
> object.size(a)
601496056 bytes #(0.6 GB)

Loading the same dataset using DataFrames' readtable takes about 30 minutes 
on the same machine (varies a bit, lowest is 25 minutes) and the resulting 
(Julia process, REPL on Terminal, takes 6GB memory on the same machine)

(I added couple of calls to @time macro inside the readtable function to 
see whats taking time - outcomes of these calls too are below)

julia> @time DataFrames.readtable("bids.csv");
WARNING: Begin readnrows call
elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc 
time)
WARNING: End readnrows call
WARNING: Begin builddf call
elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 85.54% 
gc time)
WARNING: End builddf call
elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 84.12% 
gc time) #total time for loading


Can you please suggest how I can improve load time and memory usage in 
DataFrames for sizes this big and bigger?

Thank you!

Reply via email to