Thank you Tim and Jiahao for your responses. Sorry, I did not mention in my
OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) Commit
80dd75c* (1 day old release-0.3).
I tried other releases as Tim suggested:
On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 days
old master),
the same command takes 14 minutes - half that it was taking with
release-0.3 but still 3 times more than that taken by R's read.csv (5 min).
More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
output of the command `@time DataFrames.readtable("bids.csv");` is
857.120 seconds (352 M allocations: 16601 MB, 71.59% gc time) #
reduced from 85% to 71%
For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC)
ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time
DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the
macro is:
elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc time in
33 pauses with 31 full sweep)
The process also takes up 8GB memory on the machine, more than the earlier
one. My machine has also significantly slowed down - so perhaps the
increase in memory when compared to release-0.3 is significant.
On disabling gc, my machine (4GB laptop) goes soul searching; so its not an
option for now.
Is this the best one can expect for now? I read the discussion on issue
#10428 but I did not understand it well :-(
Thank you!
On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>
> Not ideal, but for now you can try turning off the garbage collection
> while reading in the DataFrame.
>
> gc_disable()
> df = DataFrames.readtable("bids.csv")
> gc_enable()
>
>
> Thanks,
>
> Jiahao Chen
> Research Scientist
> MIT CSAIL
>
> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected] <javascript:>>
> wrote:
>
>> If you're using julia 0.3, you might want to try current master and/or
>> possibly the "ob/gctune" branch.
>>
>> https://github.com/JuliaLang/julia/issues/10428
>>
>> Best,
>> --Tim
>>
>> On Sunday, May 31, 2015 09:50:03 AM [email protected] <javascript:>
>> wrote:
>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9
>> columns
>> > (mostly
>> > strings).
>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>> >
>> > Loading the dataset in R using read.csv takes 5 minutes and the
>> resulting
>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my
>> machine)
>> >
>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>> >
>> > user system elapsed
>> > 332.295 4.154 343.332
>> >
>> > > object.size(a)
>> >
>> > 601496056 bytes #(0.6 GB)
>> >
>> > Loading the same dataset using DataFrames' readtable takes about 30
>> minutes
>> > on the same machine (varies a bit, lowest is 25 minutes) and the
>> resulting
>> > (Julia process, REPL on Terminal, takes 6GB memory on the same machine)
>> >
>> > (I added couple of calls to @time macro inside the readtable function to
>> > see whats taking time - outcomes of these calls too are below)
>> >
>> > julia> @time DataFrames.readtable("bids.csv");
>> > WARNING: Begin readnrows call
>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35% gc
>> > time)
>> > WARNING: End readnrows call
>> > WARNING: Begin builddf call
>> > elapsed time: 1809.506275842 seconds (18509704816 bytes allocated,
>> 85.54%
>> > gc time)
>> > WARNING: End builddf call
>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated,
>> 84.12%
>> > gc time) #total time for loading
>> >
>> >
>> > Can you please suggest how I can improve load time and memory usage in
>> > DataFrames for sizes this big and bigger?
>> >
>> > Thank you!
>>
>>
>