Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Jacob Quinn Mon, 01 Jun 2015 10:07:22 -0700

I've been meaning to clean some things up and properly release the
functionality, but I have a new way to read in CSV files that beats
anything else out there that I know of. To get the functionality, you'll
need to be running 0.4 master, then do


Pkg.add("SQLite")
Pkg.checkout("SQLite","jq/updates")
Pkg.clone("https://github.com/quinnj/CSV.jl";)
Pkg.clone("https://github.com/quinnj/Mmap.jl";)

I then ran the following on the bids.csv file

using SQLite, CSV

db = SQLite.SQLiteDB()

ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")

@time lines = SQLite.create(db, ff,"temp2")

It took 18 seconds on my newish MBP. From the R data.table package, the
`fread` is the other fastest CSV I know of and it took 34 seconds on my
machine. I'm actually pretty surprised by that, since in other tests I've
done it was on par with the SQLite+CSV or sometimes slightly faster.

Now, you're not necessarily getting a Julia structure in this case, but
it's loading the data into an SQLite table, that you can then run
SQLite.query(db, sql_string) to do manipulations and such.

-Jacob


On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote:

> Thank you Tim and Jiahao for your responses. Sorry, I did not mention in
> my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) Commit
> 80dd75c* (1 day old release-0.3).
>
> I tried other releases as Tim suggested:
>
> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 days
> old master),
> the same command takes 14 minutes - half that it was taking with
> release-0.3 but still 3 times more than that taken by R's read.csv (5 min).
> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
> output of the command `@time DataFrames.readtable("bids.csv");` is
> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc time) #
> reduced from 85% to 71%
>
> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC)
> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time
> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the
> macro is:
> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc time
> in 33 pauses with 31 full sweep)
> The process also takes up 8GB memory on the machine, more than the earlier
> one. My machine has also significantly slowed down - so perhaps the
> increase in memory when compared to release-0.3 is significant.
>
> On disabling gc, my machine (4GB laptop) goes soul searching; so its not
> an option for now.
>
> Is this the best one can expect for now? I read the discussion on issue
> #10428 but I did not understand it well :-(
>
> Thank you!
>
>
>
> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>
>> Not ideal, but for now you can try turning off the garbage collection
>> while reading in the DataFrame.
>>
>> gc_disable()
>> df = DataFrames.readtable("bids.csv")
>> gc_enable()
>>
>>
>> Thanks,
>>
>> Jiahao Chen
>> Research Scientist
>> MIT CSAIL
>>
>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote:
>>
>>> If you're using julia 0.3, you might want to try current master and/or
>>> possibly the "ob/gctune" branch.
>>>
>>> https://github.com/JuliaLang/julia/issues/10428
>>>
>>> Best,
>>> --Tim
>>>
>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9
>>> columns
>>> > (mostly
>>> > strings).
>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>> >
>>> > Loading the dataset in R using read.csv takes 5 minutes and the
>>> resulting
>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my
>>> machine)
>>> >
>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>> >
>>> > user   system elapsed
>>> > 332.295   4.154 343.332
>>> >
>>> > > object.size(a)
>>> >
>>> > 601496056 bytes #(0.6 GB)
>>> >
>>> > Loading the same dataset using DataFrames' readtable takes about 30
>>> minutes
>>> > on the same machine (varies a bit, lowest is 25 minutes) and the
>>> resulting
>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same machine)
>>> >
>>> > (I added couple of calls to @time macro inside the readtable function
>>> to
>>> > see whats taking time - outcomes of these calls too are below)
>>> >
>>> > julia> @time DataFrames.readtable("bids.csv");
>>> > WARNING: Begin readnrows call
>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 0.35%
>>> gc
>>> > time)
>>> > WARNING: End readnrows call
>>> > WARNING: Begin builddf call
>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes allocated,
>>> 85.54%
>>> > gc time)
>>> > WARNING: End builddf call
>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated,
>>> 84.12%
>>> > gc time) #total time for loading
>>> >
>>> >
>>> > Can you please suggest how I can improve load time and memory usage in
>>> > DataFrames for sizes this big and bigger?
>>> >
>>> > Thank you!
>>>
>>>
>>

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Reply via email to