Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 columns
(mostly
strings). https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
Loading the dataset in R using read.csv takes 5 minutes and the resulting
dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my machine)
>t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
user system elapsed
332.295 4.154 343.332
> object.size(a)
601496056 bytes #(0.6 GB)
Loading the same dataset using DataFrames' readtable takes about 30 minutes
on the same machine (varies a bit, lowest is 25 minutes) and the resulting
(Julia process, REPL on Terminal, takes 6GB memory on the same machine)
(I added couple of calls to @time macro inside the readtable function to
see whats taking time - outcomes of these calls too are below)
julia> @time DataFrames.readtable("bids.csv");
WARNING: Begin readnrows call
elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc
time)
WARNING: End readnrows call
WARNING: Begin builddf call
elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 85.54%
gc time)
WARNING: End builddf call
elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 84.12%
gc time) #total time for loading
Can you please suggest how I can improve load time and memory usage in
DataFrames for sizes this big and bigger?
Thank you!