Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Jacob Quinn Thu, 08 Oct 2015 06:16:10 -0700

So you could convert to a DataFrame without copying by doing:

DataFrame(convert(Vector{Any},ds.data),Symbol[symbol(x) for x in
ds.schema.header])


where `ds` is your `DataStream` that you read your CSV file into. The
result ends up looking a little awkward, mainly because I believe
DataFrames is calling `show` on individual elements instead of tapping into
the NullableArrays `show` code, but not a huge deal.

The other way is you can just work directly with `ds.data` which is a
`Vector{NullableVector{T}}` by default. I'll eventually provide better
functionality around specifying your own `DataStream` type to be whatever
you want (even a DataFrame if you want), but for now, we have fairly
sensible defaults.

-Jacob

p.s. on the mmap issue, are you running on a 32-bit machine? I believe it's
known that 32-bit can have troubles with large mmaps, but I'm not sure what
the collective wisdom is on dealing with that. If you're 64-bit, if you
could just share you system details + sample file, that'd be really helpful
in tracking down what's going on.

On Thu, Oct 8, 2015 at 6:54 AM, bernhard <[email protected]> wrote:

> Thank you it is working now.
> And it is blazing fast (factor 8 on a 1GB file compared to readcsv or
> readtable, memory allocation is at 17mb). I love it.
>
> Now I only need to modify the data. Is there any fast way to get an Array
> or a DataFrame of the imported table? (or would this defeat the purpose?)
> Eventually I don't need a df but a custom data structure I created. I will
> do the conversion when I find time to do so.
>
> Coincidentally I created an inexcat error in mmap 126 with a 3GB file. I
> don't have time to find the line which caused it right now though.
>
> Bernhard
>
> Am Donnerstag, 8. Oktober 2015 14:05:24 UTC+2 schrieb Jacob Quinn:
>>
>> Pushed some fixes. Thanks for trying it out.
>>
>> -Jacob
>>
>> On Wed, Oct 7, 2015 at 11:54 PM, bernhard <[email protected]> wrote:
>>
>>> Thank you Quinn
>>>
>>> Things do not work (for me) though.
>>>
>>> is it possible you are missing a comma after "col" in lines 24 and 33 of
>>> Sink.jl
>>> function writefield(io::Sink, val::AbstractString, col N)
>>>
>>>
>>>
>>> Am Mittwoch, 7. Oktober 2015 16:36:52 UTC+2 schrieb David Gold:
>>>>
>>>> Yaas. Very excited to see this.
>>>>
>>>> On Wednesday, October 7, 2015 at 6:07:44 AM UTC-7, Jacob Quinn wrote:
>>>>>
>>>>> Haha, nice timing. I just pushed a big CSV.jl overhaul for 0.4
>>>>> yesterday afternoon. I just pushed the DataStreams.jl package, so you can
>>>>> find that at https://github.com/quinnj/DataStreams.jl, and you'll
>>>>> have to Pkg.clone it. Everything should work at that point.
>>>>>
>>>>> I'm still cleaning up some other related packages, so that's why
>>>>> things aren't documented/registered/tagged quite yet as the interface may
>>>>> evolve slightly, probably more the low-level machinery. So
>>>>> `stream!(::CSV.Source, ::DataStream)` should stay the same.
>>>>>
>>>>> I've already got a bit writeup started once everything's done, so if
>>>>> you'd rather wait another couple days or a week, I should have something
>>>>> ready by then.
>>>>>
>>>>> -Jacob
>>>>>
>>>>> On Wed, Oct 7, 2015 at 12:33 AM, bernhard <[email protected]> wrote:
>>>>>
>>>>>> Is there any update on this? Or maybe a timeline/roadmap?
>>>>>> I would love to see a faster CSV reader.
>>>>>>
>>>>>> I tried to take a look at Jacob's CSV.jl.
>>>>>> But I seem to be missing https://github.com/lindahua/DataStreams.jl
>>>>>> I have no idea where to find DataStreams package....
>>>>>> Does it still exist?
>>>>>>
>>>>>> Is there any (experimental) way to make CSV.jl work?
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold:
>>>>>>>
>>>>>>> @Jacob,
>>>>>>>
>>>>>>> Thank you very much for your explanation! I expect having such a
>>>>>>> blueprint will make delving into the actual code more tractable for me.
>>>>>>> I'll be curious to see how your solution here and your proposal for 
>>>>>>> string
>>>>>>> handling end up playing with the current Julia data ecosystem.
>>>>>>>
>>>>>>> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote:
>>>>>>>>
>>>>>>>> @David,
>>>>>>>>
>>>>>>>> Sorry for the slow response. It's been a busy week :)
>>>>>>>>
>>>>>>>> Here's a quick rundown of the approach:
>>>>>>>>
>>>>>>>> - In the still-yet-to-be-officially-published
>>>>>>>> https://github.com/quinnj/CSV.jl package, the bulk of the code
>>>>>>>> goes into creating a `CSV.File` type where the structure/metadata of 
>>>>>>>> the
>>>>>>>> file is parsed/detected/saved in a type (e.g. header, delimiter, 
>>>>>>>> newline, #
>>>>>>>> of columns, detected column types, etc.)
>>>>>>>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as
>>>>>>>> input and follow a similar process in parsing:
>>>>>>>>   - The actual file contents are mmapped; i.e. the entire file is
>>>>>>>> loaded into memory at once
>>>>>>>>   - There are currently three `readfield` methods
>>>>>>>> (Int,Float64,String) that take an open `CSV.Stream` type (which holds 
>>>>>>>> the
>>>>>>>> mmapped data and the current "position" of parsing), and read a single
>>>>>>>> field according to what the type of that column is supposed to be
>>>>>>>>       - for example, readfield(io::CSV.Stream, ::Type{Float64},
>>>>>>>> row, col), will start reading at the current position of the 
>>>>>>>> `CSV.Stream`
>>>>>>>> until it hits the next delimiter, newline, or end of the file and then
>>>>>>>> interpret the contents as a Float64, returning `val, isnull`
>>>>>>>>
>>>>>>>> That's pretty much it. One of the most critical performance keys
>>>>>>>> for both SQLite and CSV.read is non-copying strings once the file has 
>>>>>>>> been
>>>>>>>> mmapped. For SQLite, the sqlite3_bind_text library method actually has 
>>>>>>>> a
>>>>>>>> flag to indicate whether the text should be copied or not, so we're 
>>>>>>>> able to
>>>>>>>> pass the pointer to the position in the mmapped array directly. For the
>>>>>>>> CSV.read method, which returns a Vector of the columns (as typed 
>>>>>>>> arrays),
>>>>>>>> I've actually rolled a quick and dirty CString type that looks like
>>>>>>>>
>>>>>>>> immutable CString
>>>>>>>>   ptr::Ptr{UInt8}
>>>>>>>>   len::Int
>>>>>>>> end
>>>>>>>>
>>>>>>>> With a few extra method definitions, this type looks very close to
>>>>>>>> a real string type, but we can construct it by pointing directly to the
>>>>>>>> mmapped region (which currently isn't possible for native Julia string
>>>>>>>> types). See https://github.com/quinnj/Strings.jl for more
>>>>>>>> brainstorming around this alternative string implementation. You can
>>>>>>>> convert a CString to a Julia string by calling string(x::CString) or
>>>>>>>> map(string,column) for an Array of CSV.CStrings.
>>>>>>>>
>>>>>>>> As an update on the performance on the Facebook Kaggle competition
>>>>>>>> bids.csv file:
>>>>>>>>
>>>>>>>> -readcsv: 45 seconds, 33% gc time
>>>>>>>> -CSV.read: 19 seconds, 3% gc time
>>>>>>>> -SQLite.create: 25 seconds, 3.25% gc time
>>>>>>>>
>>>>>>>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be
>>>>>>>> released officially, but it's that last 10-20% that's always the 
>>>>>>>> hardest to
>>>>>>>> finish up :)
>>>>>>>>
>>>>>>>> -Jacob
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> @Jacob I'm just developing a working understanding of these
>>>>>>>>> issues. Would you please help me to get a better handle on your 
>>>>>>>>> solution?
>>>>>>>>>
>>>>>>>>> My understanding thus far: Reading a (local) .csv file into a
>>>>>>>>> DataFrame using DataFrames.readtable involves reading the file into an
>>>>>>>>> IOStream and then parsing that stream into a form amenable to parsing 
>>>>>>>>> by
>>>>>>>>> DataFrames.builddf, which builds the DataFrame object returned by
>>>>>>>>> readtable. The work required to get the contents of the .csv file into
>>>>>>>>> memory in a form that can be manipulated by Julia functions is
>>>>>>>>> work-intensive in this manner. However, with SQLite, the entire file 
>>>>>>>>> can
>>>>>>>>> just be thrown into memory wholesale, along with some metadata (maybe 
>>>>>>>>> not
>>>>>>>>> the right term?) that delineates the tabular properties of the data.
>>>>>>>>>
>>>>>>>>> What I am curious about, then (if this understanding is not too
>>>>>>>>> misguided), is how SQLite returns, say, a column of data that doesn't
>>>>>>>>> include, say, a bunch of delimiters. That is, what sort of parsing 
>>>>>>>>> *does*
>>>>>>>>> SQLite do, and when?
>>>>>>>>>
>>>>>>>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote:
>>>>>>>>>>
>>>>>>>>>> The biggest single advantage SQLite has is the ability to mmap a
>>>>>>>>>> file and just tell SQLite which pointer addresses start strings and 
>>>>>>>>>> how
>>>>>>>>>> long they are, all without copying. The huge, huge bottleneck in most
>>>>>>>>>> implementations, is not just identifying where a string starts and 
>>>>>>>>>> how long
>>>>>>>>>> it is, but then allocating "in program" memory and copying the 
>>>>>>>>>> string into
>>>>>>>>>> it. With SQLite, we can use an in-memory database, mmap the file, 
>>>>>>>>>> and tell
>>>>>>>>>> SQLite where each string for a column lives by giving it the starting
>>>>>>>>>> pointer address and how long it is. I've been looking into how to 
>>>>>>>>>> solve
>>>>>>>>>> this problem over the last month or so (apart from Oscar's gc 
>>>>>>>>>> wizardry) and
>>>>>>>>>> it just occurred to me last week that using SQLite may be the best 
>>>>>>>>>> way; so
>>>>>>>>>> far, the results are promising!
>>>>>>>>>>
>>>>>>>>>> -Jacob
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Great, thank you Jacob, I will try it out!
>>>>>>>>>>>
>>>>>>>>>>> Do you have a writeup on differences in the way you read CSV
>>>>>>>>>>> files and the way it is currently done in Julia? Would love to know 
>>>>>>>>>>> more!
>>>>>>>>>>>
>>>>>>>>>>> Obvious perhaps but for completeness: Reading the data using
>>>>>>>>>>> readcsv or readdlm does not improve much the metrics I reported, 
>>>>>>>>>>> suggesting
>>>>>>>>>>> that the overhead from DataFrames is not much.
>>>>>>>>>>>
>>>>>>>>>>> Thank you again!
>>>>>>>>>>>
>>>>>>>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I've been meaning to clean some things up and properly release
>>>>>>>>>>>> the functionality, but I have a new way to read in CSV files that 
>>>>>>>>>>>> beats
>>>>>>>>>>>> anything else out there that I know of. To get the functionality, 
>>>>>>>>>>>> you'll
>>>>>>>>>>>> need to be running 0.4 master, then do
>>>>>>>>>>>>
>>>>>>>>>>>> Pkg.add("SQLite")
>>>>>>>>>>>> Pkg.checkout("SQLite","jq/updates")
>>>>>>>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>>>>>>>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>>>>>>>>>>>>
>>>>>>>>>>>> I then ran the following on the bids.csv file
>>>>>>>>>>>>
>>>>>>>>>>>> using SQLite, CSV
>>>>>>>>>>>>
>>>>>>>>>>>> db = SQLite.SQLiteDB()
>>>>>>>>>>>>
>>>>>>>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>>>>>>>>>>>>
>>>>>>>>>>>> @time lines = SQLite.create(db, ff,"temp2")
>>>>>>>>>>>>
>>>>>>>>>>>> It took 18 seconds on my newish MBP. From the R data.table
>>>>>>>>>>>> package, the `fread` is the other fastest CSV I know of and it 
>>>>>>>>>>>> took 34
>>>>>>>>>>>> seconds on my machine. I'm actually pretty surprised by that, 
>>>>>>>>>>>> since in
>>>>>>>>>>>> other tests I've done it was on par with the SQLite+CSV or 
>>>>>>>>>>>> sometimes
>>>>>>>>>>>> slightly faster.
>>>>>>>>>>>>
>>>>>>>>>>>> Now, you're not necessarily getting a Julia structure in this
>>>>>>>>>>>> case, but it's loading the data into an SQLite table, that you can 
>>>>>>>>>>>> then run
>>>>>>>>>>>> SQLite.query(db, sql_string) to do manipulations and such.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jacob
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not
>>>>>>>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 
>>>>>>>>>>>>> (2015-05-30 11:26
>>>>>>>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried other releases as Tim suggested:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit
>>>>>>>>>>>>> bfa8648* (0 days old master),
>>>>>>>>>>>>> the same command takes 14 minutes - half that it was taking
>>>>>>>>>>>>> with release-0.3 but still 3 times more than that taken by R's 
>>>>>>>>>>>>> read.csv (5
>>>>>>>>>>>>> min). More important, Julia process takes up 8GB memory (Rsession 
>>>>>>>>>>>>> takes
>>>>>>>>>>>>> 1.6GB)
>>>>>>>>>>>>> output of the command `@time
>>>>>>>>>>>>> DataFrames.readtable("bids.csv");` is
>>>>>>>>>>>>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc
>>>>>>>>>>>>> time) # reduced from 85% to 71%
>>>>>>>>>>>>>
>>>>>>>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55
>>>>>>>>>>>>> UTC) ob/gctune/238ed08* (fork: 1 commits, 39 days), the command 
>>>>>>>>>>>>> `@time
>>>>>>>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output 
>>>>>>>>>>>>> of the
>>>>>>>>>>>>> macro is:
>>>>>>>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated,
>>>>>>>>>>>>> 76.58% gc time in 33 pauses with 31 full sweep)
>>>>>>>>>>>>> The process also takes up 8GB memory on the machine, more than
>>>>>>>>>>>>> the earlier one. My machine has also significantly slowed down - 
>>>>>>>>>>>>> so perhaps
>>>>>>>>>>>>> the increase in memory when compared to release-0.3 is 
>>>>>>>>>>>>> significant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching;
>>>>>>>>>>>>> so its not an option for now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this the best one can expect for now? I read the discussion
>>>>>>>>>>>>> on issue #10428 but I did not understand it well :-(
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not ideal, but for now you can try turning off the garbage
>>>>>>>>>>>>>> collection while reading in the DataFrame.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> gc_disable()
>>>>>>>>>>>>>> df = DataFrames.readtable("bids.csv")
>>>>>>>>>>>>>> gc_enable()
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jiahao Chen
>>>>>>>>>>>>>> Research Scientist
>>>>>>>>>>>>>> MIT CSAIL
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If you're using julia 0.3, you might want to try current
>>>>>>>>>>>>>>> master and/or
>>>>>>>>>>>>>>> possibly the "ob/gctune" branch.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> --Tim
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected]
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6
>>>>>>>>>>>>>>> rows with 9 columns
>>>>>>>>>>>>>>> > (mostly
>>>>>>>>>>>>>>> > strings).
>>>>>>>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes
>>>>>>>>>>>>>>> and the resulting
>>>>>>>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB
>>>>>>>>>>>>>>> memory on my machine)
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > user   system elapsed
>>>>>>>>>>>>>>> > 332.295   4.154 343.332
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > > object.size(a)
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > 601496056 bytes #(0.6 GB)
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes
>>>>>>>>>>>>>>> about 30 minutes
>>>>>>>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes)
>>>>>>>>>>>>>>> and the resulting
>>>>>>>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the
>>>>>>>>>>>>>>> same machine)
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > (I added couple of calls to @time macro inside the
>>>>>>>>>>>>>>> readtable function to
>>>>>>>>>>>>>>> > see whats taking time - outcomes of these calls too are
>>>>>>>>>>>>>>> below)
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>>>>>>>>>>>>> > WARNING: Begin readnrows call
>>>>>>>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes
>>>>>>>>>>>>>>> allocated, 0.35% gc
>>>>>>>>>>>>>>> > time)
>>>>>>>>>>>>>>> > WARNING: End readnrows call
>>>>>>>>>>>>>>> > WARNING: Begin builddf call
>>>>>>>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes
>>>>>>>>>>>>>>> allocated, 85.54%
>>>>>>>>>>>>>>> > gc time)
>>>>>>>>>>>>>>> > WARNING: End builddf call
>>>>>>>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes
>>>>>>>>>>>>>>> allocated, 84.12%
>>>>>>>>>>>>>>> > gc time) #total time for loading
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Can you please suggest how I can improve load time and
>>>>>>>>>>>>>>> memory usage in
>>>>>>>>>>>>>>> > DataFrames for sizes this big and bigger?
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Thank you!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>
>>

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Reply via email to