Not ideal, but for now you can try turning off the garbage collection while
reading in the DataFrame.

gc_disable()
df = DataFrames.readtable("bids.csv")
gc_enable()


Thanks,

Jiahao Chen
Research Scientist
MIT CSAIL

On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote:

> If you're using julia 0.3, you might want to try current master and/or
> possibly the "ob/gctune" branch.
>
> https://github.com/JuliaLang/julia/issues/10428
>
> Best,
> --Tim
>
> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9
> columns
> > (mostly
> > strings).
> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
> >
> > Loading the dataset in R using read.csv takes 5 minutes and the resulting
> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my
> machine)
> >
> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
> >
> > user   system elapsed
> > 332.295   4.154 343.332
> >
> > > object.size(a)
> >
> > 601496056 bytes #(0.6 GB)
> >
> > Loading the same dataset using DataFrames' readtable takes about 30
> minutes
> > on the same machine (varies a bit, lowest is 25 minutes) and the
> resulting
> > (Julia process, REPL on Terminal, takes 6GB memory on the same machine)
> >
> > (I added couple of calls to @time macro inside the readtable function to
> > see whats taking time - outcomes of these calls too are below)
> >
> > julia> @time DataFrames.readtable("bids.csv");
> > WARNING: Begin readnrows call
> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc
> > time)
> > WARNING: End readnrows call
> > WARNING: Begin builddf call
> > elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 85.54%
> > gc time)
> > WARNING: End builddf call
> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 84.12%
> > gc time) #total time for loading
> >
> >
> > Can you please suggest how I can improve load time and memory usage in
> > DataFrames for sizes this big and bigger?
> >
> > Thank you!
>
>

Reply via email to