Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Jacob Quinn Thu, 08 Oct 2015 05:05:40 -0700

Pushed some fixes. Thanks for trying it out.

-Jacob


On Wed, Oct 7, 2015 at 11:54 PM, bernhard <[email protected]> wrote:

> Thank you Quinn
>
> Things do not work (for me) though.
>
> is it possible you are missing a comma after "col" in lines 24 and 33 of
> Sink.jl
> function writefield(io::Sink, val::AbstractString, col N)
>
>
>
> Am Mittwoch, 7. Oktober 2015 16:36:52 UTC+2 schrieb David Gold:
>>
>> Yaas. Very excited to see this.
>>
>> On Wednesday, October 7, 2015 at 6:07:44 AM UTC-7, Jacob Quinn wrote:
>>>
>>> Haha, nice timing. I just pushed a big CSV.jl overhaul for 0.4 yesterday
>>> afternoon. I just pushed the DataStreams.jl package, so you can find that
>>> at https://github.com/quinnj/DataStreams.jl, and you'll have to
>>> Pkg.clone it. Everything should work at that point.
>>>
>>> I'm still cleaning up some other related packages, so that's why things
>>> aren't documented/registered/tagged quite yet as the interface may evolve
>>> slightly, probably more the low-level machinery. So `stream!(::CSV.Source,
>>> ::DataStream)` should stay the same.
>>>
>>> I've already got a bit writeup started once everything's done, so if
>>> you'd rather wait another couple days or a week, I should have something
>>> ready by then.
>>>
>>> -Jacob
>>>
>>> On Wed, Oct 7, 2015 at 12:33 AM, bernhard <[email protected]> wrote:
>>>
>>>> Is there any update on this? Or maybe a timeline/roadmap?
>>>> I would love to see a faster CSV reader.
>>>>
>>>> I tried to take a look at Jacob's CSV.jl.
>>>> But I seem to be missing https://github.com/lindahua/DataStreams.jl
>>>> I have no idea where to find DataStreams package....
>>>> Does it still exist?
>>>>
>>>> Is there any (experimental) way to make CSV.jl work?
>>>>
>>>>
>>>>
>>>>> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold:
>>>>>
>>>>> @Jacob,
>>>>>
>>>>> Thank you very much for your explanation! I expect having such a
>>>>> blueprint will make delving into the actual code more tractable for me.
>>>>> I'll be curious to see how your solution here and your proposal for string
>>>>> handling end up playing with the current Julia data ecosystem.
>>>>>
>>>>> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote:
>>>>>>
>>>>>> @David,
>>>>>>
>>>>>> Sorry for the slow response. It's been a busy week :)
>>>>>>
>>>>>> Here's a quick rundown of the approach:
>>>>>>
>>>>>> - In the still-yet-to-be-officially-published
>>>>>> https://github.com/quinnj/CSV.jl package, the bulk of the code goes
>>>>>> into creating a `CSV.File` type where the structure/metadata of the file 
>>>>>> is
>>>>>> parsed/detected/saved in a type (e.g. header, delimiter, newline, # of
>>>>>> columns, detected column types, etc.)
>>>>>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input
>>>>>> and follow a similar process in parsing:
>>>>>>   - The actual file contents are mmapped; i.e. the entire file is
>>>>>> loaded into memory at once
>>>>>>   - There are currently three `readfield` methods
>>>>>> (Int,Float64,String) that take an open `CSV.Stream` type (which holds the
>>>>>> mmapped data and the current "position" of parsing), and read a single
>>>>>> field according to what the type of that column is supposed to be
>>>>>>       - for example, readfield(io::CSV.Stream, ::Type{Float64}, row,
>>>>>> col), will start reading at the current position of the `CSV.Stream` 
>>>>>> until
>>>>>> it hits the next delimiter, newline, or end of the file and then 
>>>>>> interpret
>>>>>> the contents as a Float64, returning `val, isnull`
>>>>>>
>>>>>> That's pretty much it. One of the most critical performance keys for
>>>>>> both SQLite and CSV.read is non-copying strings once the file has been
>>>>>> mmapped. For SQLite, the sqlite3_bind_text library method actually has a
>>>>>> flag to indicate whether the text should be copied or not, so we're able 
>>>>>> to
>>>>>> pass the pointer to the position in the mmapped array directly. For the
>>>>>> CSV.read method, which returns a Vector of the columns (as typed arrays),
>>>>>> I've actually rolled a quick and dirty CString type that looks like
>>>>>>
>>>>>> immutable CString
>>>>>>   ptr::Ptr{UInt8}
>>>>>>   len::Int
>>>>>> end
>>>>>>
>>>>>> With a few extra method definitions, this type looks very close to a
>>>>>> real string type, but we can construct it by pointing directly to the
>>>>>> mmapped region (which currently isn't possible for native Julia string
>>>>>> types). See https://github.com/quinnj/Strings.jl for more
>>>>>> brainstorming around this alternative string implementation. You can
>>>>>> convert a CString to a Julia string by calling string(x::CString) or
>>>>>> map(string,column) for an Array of CSV.CStrings.
>>>>>>
>>>>>> As an update on the performance on the Facebook Kaggle competition
>>>>>> bids.csv file:
>>>>>>
>>>>>> -readcsv: 45 seconds, 33% gc time
>>>>>> -CSV.read: 19 seconds, 3% gc time
>>>>>> -SQLite.create: 25 seconds, 3.25% gc time
>>>>>>
>>>>>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be
>>>>>> released officially, but it's that last 10-20% that's always the hardest 
>>>>>> to
>>>>>> finish up :)
>>>>>>
>>>>>> -Jacob
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> @Jacob I'm just developing a working understanding of these issues.
>>>>>>> Would you please help me to get a better handle on your solution?
>>>>>>>
>>>>>>> My understanding thus far: Reading a (local) .csv file into a
>>>>>>> DataFrame using DataFrames.readtable involves reading the file into an
>>>>>>> IOStream and then parsing that stream into a form amenable to parsing by
>>>>>>> DataFrames.builddf, which builds the DataFrame object returned by
>>>>>>> readtable. The work required to get the contents of the .csv file into
>>>>>>> memory in a form that can be manipulated by Julia functions is
>>>>>>> work-intensive in this manner. However, with SQLite, the entire file can
>>>>>>> just be thrown into memory wholesale, along with some metadata (maybe 
>>>>>>> not
>>>>>>> the right term?) that delineates the tabular properties of the data.
>>>>>>>
>>>>>>> What I am curious about, then (if this understanding is not too
>>>>>>> misguided), is how SQLite returns, say, a column of data that doesn't
>>>>>>> include, say, a bunch of delimiters. That is, what sort of parsing 
>>>>>>> *does*
>>>>>>> SQLite do, and when?
>>>>>>>
>>>>>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote:
>>>>>>>>
>>>>>>>> The biggest single advantage SQLite has is the ability to mmap a
>>>>>>>> file and just tell SQLite which pointer addresses start strings and how
>>>>>>>> long they are, all without copying. The huge, huge bottleneck in most
>>>>>>>> implementations, is not just identifying where a string starts and how 
>>>>>>>> long
>>>>>>>> it is, but then allocating "in program" memory and copying the string 
>>>>>>>> into
>>>>>>>> it. With SQLite, we can use an in-memory database, mmap the file, and 
>>>>>>>> tell
>>>>>>>> SQLite where each string for a column lives by giving it the starting
>>>>>>>> pointer address and how long it is. I've been looking into how to solve
>>>>>>>> this problem over the last month or so (apart from Oscar's gc 
>>>>>>>> wizardry) and
>>>>>>>> it just occurred to me last week that using SQLite may be the best 
>>>>>>>> way; so
>>>>>>>> far, the results are promising!
>>>>>>>>
>>>>>>>> -Jacob
>>>>>>>>
>>>>>>>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Great, thank you Jacob, I will try it out!
>>>>>>>>>
>>>>>>>>> Do you have a writeup on differences in the way you read CSV files
>>>>>>>>> and the way it is currently done in Julia? Would love to know more!
>>>>>>>>>
>>>>>>>>> Obvious perhaps but for completeness: Reading the data using
>>>>>>>>> readcsv or readdlm does not improve much the metrics I reported, 
>>>>>>>>> suggesting
>>>>>>>>> that the overhead from DataFrames is not much.
>>>>>>>>>
>>>>>>>>> Thank you again!
>>>>>>>>>
>>>>>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>>>>>>>>>>
>>>>>>>>>> I've been meaning to clean some things up and properly release
>>>>>>>>>> the functionality, but I have a new way to read in CSV files that 
>>>>>>>>>> beats
>>>>>>>>>> anything else out there that I know of. To get the functionality, 
>>>>>>>>>> you'll
>>>>>>>>>> need to be running 0.4 master, then do
>>>>>>>>>>
>>>>>>>>>> Pkg.add("SQLite")
>>>>>>>>>> Pkg.checkout("SQLite","jq/updates")
>>>>>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>>>>>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>>>>>>>>>>
>>>>>>>>>> I then ran the following on the bids.csv file
>>>>>>>>>>
>>>>>>>>>> using SQLite, CSV
>>>>>>>>>>
>>>>>>>>>> db = SQLite.SQLiteDB()
>>>>>>>>>>
>>>>>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>>>>>>>>>>
>>>>>>>>>> @time lines = SQLite.create(db, ff,"temp2")
>>>>>>>>>>
>>>>>>>>>> It took 18 seconds on my newish MBP. From the R data.table
>>>>>>>>>> package, the `fread` is the other fastest CSV I know of and it took 
>>>>>>>>>> 34
>>>>>>>>>> seconds on my machine. I'm actually pretty surprised by that, since 
>>>>>>>>>> in
>>>>>>>>>> other tests I've done it was on par with the SQLite+CSV or sometimes
>>>>>>>>>> slightly faster.
>>>>>>>>>>
>>>>>>>>>> Now, you're not necessarily getting a Julia structure in this
>>>>>>>>>> case, but it's loading the data into an SQLite table, that you can 
>>>>>>>>>> then run
>>>>>>>>>> SQLite.query(db, sql_string) to do manipulations and such.
>>>>>>>>>>
>>>>>>>>>> -Jacob
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not
>>>>>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 
>>>>>>>>>>> 11:26
>>>>>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3).
>>>>>>>>>>>
>>>>>>>>>>> I tried other releases as Tim suggested:
>>>>>>>>>>>
>>>>>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648*
>>>>>>>>>>> (0 days old master),
>>>>>>>>>>> the same command takes 14 minutes - half that it was taking with
>>>>>>>>>>> release-0.3 but still 3 times more than that taken by R's read.csv 
>>>>>>>>>>> (5 min).
>>>>>>>>>>> More important, Julia process takes up 8GB memory (Rsession takes 
>>>>>>>>>>> 1.6GB)
>>>>>>>>>>> output of the command `@time DataFrames.readtable("bids.csv");`
>>>>>>>>>>> is
>>>>>>>>>>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc
>>>>>>>>>>> time) # reduced from 85% to 71%
>>>>>>>>>>>
>>>>>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55
>>>>>>>>>>> UTC) ob/gctune/238ed08* (fork: 1 commits, 39 days), the command 
>>>>>>>>>>> `@time
>>>>>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of 
>>>>>>>>>>> the
>>>>>>>>>>> macro is:
>>>>>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58%
>>>>>>>>>>> gc time in 33 pauses with 31 full sweep)
>>>>>>>>>>> The process also takes up 8GB memory on the machine, more than
>>>>>>>>>>> the earlier one. My machine has also significantly slowed down - so 
>>>>>>>>>>> perhaps
>>>>>>>>>>> the increase in memory when compared to release-0.3 is significant.
>>>>>>>>>>>
>>>>>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so
>>>>>>>>>>> its not an option for now.
>>>>>>>>>>>
>>>>>>>>>>> Is this the best one can expect for now? I read the discussion
>>>>>>>>>>> on issue #10428 but I did not understand it well :-(
>>>>>>>>>>>
>>>>>>>>>>> Thank you!
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Not ideal, but for now you can try turning off the garbage
>>>>>>>>>>>> collection while reading in the DataFrame.
>>>>>>>>>>>>
>>>>>>>>>>>> gc_disable()
>>>>>>>>>>>> df = DataFrames.readtable("bids.csv")
>>>>>>>>>>>> gc_enable()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Jiahao Chen
>>>>>>>>>>>> Research Scientist
>>>>>>>>>>>> MIT CSAIL
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you're using julia 0.3, you might want to try current
>>>>>>>>>>>>> master and/or
>>>>>>>>>>>>> possibly the "ob/gctune" branch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> --Tim
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
>>>>>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows
>>>>>>>>>>>>> with 9 columns
>>>>>>>>>>>>> > (mostly
>>>>>>>>>>>>> > strings).
>>>>>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and
>>>>>>>>>>>>> the resulting
>>>>>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory
>>>>>>>>>>>>> on my machine)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > user   system elapsed
>>>>>>>>>>>>> > 332.295   4.154 343.332
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > > object.size(a)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > 601496056 bytes #(0.6 GB)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes
>>>>>>>>>>>>> about 30 minutes
>>>>>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and
>>>>>>>>>>>>> the resulting
>>>>>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the
>>>>>>>>>>>>> same machine)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > (I added couple of calls to @time macro inside the readtable
>>>>>>>>>>>>> function to
>>>>>>>>>>>>> > see whats taking time - outcomes of these calls too are
>>>>>>>>>>>>> below)
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>>>>>>>>>>> > WARNING: Begin readnrows call
>>>>>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes
>>>>>>>>>>>>> allocated, 0.35% gc
>>>>>>>>>>>>> > time)
>>>>>>>>>>>>> > WARNING: End readnrows call
>>>>>>>>>>>>> > WARNING: Begin builddf call
>>>>>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes
>>>>>>>>>>>>> allocated, 85.54%
>>>>>>>>>>>>> > gc time)
>>>>>>>>>>>>> > WARNING: End builddf call
>>>>>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes
>>>>>>>>>>>>> allocated, 84.12%
>>>>>>>>>>>>> > gc time) #total time for loading
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Can you please suggest how I can improve load time and
>>>>>>>>>>>>> memory usage in
>>>>>>>>>>>>> > DataFrames for sizes this big and bigger?
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Thank you!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Reply via email to