Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Jacob Quinn Mon, 01 Jun 2015 10:48:26 -0700

The biggest single advantage SQLite has is the ability to mmap a file and
just tell SQLite which pointer addresses start strings and how long they
are, all without copying. The huge, huge bottleneck in most
implementations, is not just identifying where a string starts and how long
it is, but then allocating "in program" memory and copying the string into
it. With SQLite, we can use an in-memory database, mmap the file, and tell
SQLite where each string for a column lives by giving it the starting
pointer address and how long it is. I've been looking into how to solve
this problem over the last month or so (apart from Oscar's gc wizardry) and
it just occurred to me last week that using SQLite may be the best way; so
far, the results are promising!


-Jacob

On Mon, Jun 1, 2015 at 11:40 AM, <verylucky...@gmail.com> wrote:

> Great, thank you Jacob, I will try it out!
>
> Do you have a writeup on differences in the way you read CSV files and the
> way it is currently done in Julia? Would love to know more!
>
> Obvious perhaps but for completeness: Reading the data using readcsv or
> readdlm does not improve much the metrics I reported, suggesting that the
> overhead from DataFrames is not much.
>
> Thank you again!
>
> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>>
>> I've been meaning to clean some things up and properly release the
>> functionality, but I have a new way to read in CSV files that beats
>> anything else out there that I know of. To get the functionality, you'll
>> need to be running 0.4 master, then do
>>
>> Pkg.add("SQLite")
>> Pkg.checkout("SQLite","jq/updates")
>> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>>
>> I then ran the following on the bids.csv file
>>
>> using SQLite, CSV
>>
>> db = SQLite.SQLiteDB()
>>
>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>>
>> @time lines = SQLite.create(db, ff,"temp2")
>>
>> It took 18 seconds on my newish MBP. From the R data.table package, the
>> `fread` is the other fastest CSV I know of and it took 34 seconds on my
>> machine. I'm actually pretty surprised by that, since in other tests I've
>> done it was on par with the SQLite+CSV or sometimes slightly faster.
>>
>> Now, you're not necessarily getting a Julia structure in this case, but
>> it's loading the data into an SQLite table, that you can then run
>> SQLite.query(db, sql_string) to do manipulations and such.
>>
>> -Jacob
>>
>>
>> On Sun, May 31, 2015 at 9:42 PM, <verylu...@gmail.com> wrote:
>>
>>> Thank you Tim and Jiahao for your responses. Sorry, I did not mention in
>>> my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) Commit
>>> 80dd75c* (1 day old release-0.3).
>>>
>>> I tried other releases as Tim suggested:
>>>
>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 days
>>> old master),
>>> the same command takes 14 minutes - half that it was taking with
>>> release-0.3 but still 3 times more than that taken by R's read.csv (5 min).
>>> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
>>> output of the command `@time DataFrames.readtable("bids.csv");` is
>>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc time) #
>>> reduced from 85% to 71%
>>>
>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC)
>>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time
>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the
>>> macro is:
>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc time
>>> in 33 pauses with 31 full sweep)
>>> The process also takes up 8GB memory on the machine, more than the
>>> earlier one. My machine has also significantly slowed down - so perhaps the
>>> increase in memory when compared to release-0.3 is significant.
>>>
>>> On disabling gc, my machine (4GB laptop) goes soul searching; so its not
>>> an option for now.
>>>
>>> Is this the best one can expect for now? I read the discussion on issue
>>> #10428 but I did not understand it well :-(
>>>
>>> Thank you!
>>>
>>>
>>>
>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>>
>>>> Not ideal, but for now you can try turning off the garbage collection
>>>> while reading in the DataFrame.
>>>>
>>>> gc_disable()
>>>> df = DataFrames.readtable("bids.csv")
>>>> gc_enable()
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Jiahao Chen
>>>> Research Scientist
>>>> MIT CSAIL
>>>>
>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <tim....@gmail.com> wrote:
>>>>
>>>>> If you're using julia 0.3, you might want to try current master and/or
>>>>> possibly the "ob/gctune" branch.
>>>>>
>>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>>
>>>>> Best,
>>>>> --Tim
>>>>>
>>>>> On Sunday, May 31, 2015 09:50:03 AM verylu...@gmail.com wrote:
>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9
>>>>> columns
>>>>> > (mostly
>>>>> > strings).
>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>>> >
>>>>> > Loading the dataset in R using read.csv takes 5 minutes and the
>>>>> resulting
>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my
>>>>> machine)
>>>>> >
>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>>> >
>>>>> > user   system elapsed
>>>>> > 332.295   4.154 343.332
>>>>> >
>>>>> > > object.size(a)
>>>>> >
>>>>> > 601496056 bytes #(0.6 GB)
>>>>> >
>>>>> > Loading the same dataset using DataFrames' readtable takes about 30
>>>>> minutes
>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and the
>>>>> resulting
>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same
>>>>> machine)
>>>>> >
>>>>> > (I added couple of calls to @time macro inside the readtable
>>>>> function to
>>>>> > see whats taking time - outcomes of these calls too are below)
>>>>> >
>>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>>> > WARNING: Begin readnrows call
>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated,
>>>>> 0.35% gc
>>>>> > time)
>>>>> > WARNING: End readnrows call
>>>>> > WARNING: Begin builddf call
>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes allocated,
>>>>> 85.54%
>>>>> > gc time)
>>>>> > WARNING: End builddf call
>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated,
>>>>> 84.12%
>>>>> > gc time) #total time for loading
>>>>> >
>>>>> >
>>>>> > Can you please suggest how I can improve load time and memory usage
>>>>> in
>>>>> > DataFrames for sizes this big and bigger?
>>>>> >
>>>>> > Thank you!
>>>>>
>>>>>
>>>>
>>

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Reply via email to