Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

bernhard Thu, 08 Oct 2015 05:56:38 -0700

Thank you it is working now.
And it is blazing fast (factor 8 on a 1GB file compared to readcsv or 
readtable, memory allocation is at 17mb). I love it.


Now I only need to modify the data. Is there any fast way to get an Array 
or a DataFrame of the imported table? (or would this defeat the purpose?)
Eventually I don't need a df but a custom data structure I created. I will 
do the conversion when I find time to do so.

Coincidentally I created an inexcat error in mmap 126 with a 3GB file. I 
don't have time to find the line which caused it right now though.

Bernhard

Am Donnerstag, 8. Oktober 2015 14:05:24 UTC+2 schrieb Jacob Quinn:
>
> Pushed some fixes. Thanks for trying it out.
>
> -Jacob
>
> On Wed, Oct 7, 2015 at 11:54 PM, bernhard <[email protected] <javascript:>
> > wrote:
>
>> Thank you Quinn
>>
>> Things do not work (for me) though.
>>
>> is it possible you are missing a comma after "col" in lines 24 and 33 of 
>> Sink.jl
>> function writefield(io::Sink, val::AbstractString, col N)
>>
>>
>>
>> Am Mittwoch, 7. Oktober 2015 16:36:52 UTC+2 schrieb David Gold:
>>>
>>> Yaas. Very excited to see this.
>>>
>>> On Wednesday, October 7, 2015 at 6:07:44 AM UTC-7, Jacob Quinn wrote:
>>>>
>>>> Haha, nice timing. I just pushed a big CSV.jl overhaul for 0.4 
>>>> yesterday afternoon. I just pushed the DataStreams.jl package, so you can 
>>>> find that at https://github.com/quinnj/DataStreams.jl, and you'll have 
>>>> to Pkg.clone it. Everything should work at that point.
>>>>
>>>> I'm still cleaning up some other related packages, so that's why things 
>>>> aren't documented/registered/tagged quite yet as the interface may evolve 
>>>> slightly, probably more the low-level machinery. So `stream!(::CSV.Source, 
>>>> ::DataStream)` should stay the same.
>>>>
>>>> I've already got a bit writeup started once everything's done, so if 
>>>> you'd rather wait another couple days or a week, I should have something 
>>>> ready by then.
>>>>
>>>> -Jacob
>>>>
>>>> On Wed, Oct 7, 2015 at 12:33 AM, bernhard <[email protected]> wrote:
>>>>
>>>>> Is there any update on this? Or maybe a timeline/roadmap?
>>>>> I would love to see a faster CSV reader. 
>>>>>
>>>>> I tried to take a look at Jacob's CSV.jl.
>>>>> But I seem to be missing https://github.com/lindahua/DataStreams.jl 
>>>>> I have no idea where to find DataStreams package....
>>>>> Does it still exist?
>>>>>
>>>>> Is there any (experimental) way to make CSV.jl work?
>>>>>
>>>>>
>>>>>
>>>>>> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold:
>>>>>>
>>>>>> @Jacob,
>>>>>>
>>>>>> Thank you very much for your explanation! I expect having such a 
>>>>>> blueprint will make delving into the actual code more tractable for me. 
>>>>>> I'll be curious to see how your solution here and your proposal for 
>>>>>> string 
>>>>>> handling end up playing with the current Julia data ecosystem. 
>>>>>>
>>>>>> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote:
>>>>>>>
>>>>>>> @David,
>>>>>>>
>>>>>>> Sorry for the slow response. It's been a busy week :)
>>>>>>>
>>>>>>> Here's a quick rundown of the approach:
>>>>>>>
>>>>>>> - In the still-yet-to-be-officially-published 
>>>>>>> https://github.com/quinnj/CSV.jl package, the bulk of the code goes 
>>>>>>> into creating a `CSV.File` type where the structure/metadata of the 
>>>>>>> file is 
>>>>>>> parsed/detected/saved in a type (e.g. header, delimiter, newline, # of 
>>>>>>> columns, detected column types, etc.)
>>>>>>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input 
>>>>>>> and follow a similar process in parsing:
>>>>>>>   - The actual file contents are mmapped; i.e. the entire file is 
>>>>>>> loaded into memory at once
>>>>>>>   - There are currently three `readfield` methods 
>>>>>>> (Int,Float64,String) that take an open `CSV.Stream` type (which holds 
>>>>>>> the 
>>>>>>> mmapped data and the current "position" of parsing), and read a single 
>>>>>>> field according to what the type of that column is supposed to be
>>>>>>>       - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, 
>>>>>>> col), will start reading at the current position of the `CSV.Stream` 
>>>>>>> until 
>>>>>>> it hits the next delimiter, newline, or end of the file and then 
>>>>>>> interpret 
>>>>>>> the contents as a Float64, returning `val, isnull`
>>>>>>>
>>>>>>> That's pretty much it. One of the most critical performance keys for 
>>>>>>> both SQLite and CSV.read is non-copying strings once the file has been 
>>>>>>> mmapped. For SQLite, the sqlite3_bind_text library method actually has 
>>>>>>> a 
>>>>>>> flag to indicate whether the text should be copied or not, so we're 
>>>>>>> able to 
>>>>>>> pass the pointer to the position in the mmapped array directly. For the 
>>>>>>> CSV.read method, which returns a Vector of the columns (as typed 
>>>>>>> arrays), 
>>>>>>> I've actually rolled a quick and dirty CString type that looks like
>>>>>>>
>>>>>>> immutable CString
>>>>>>>   ptr::Ptr{UInt8}
>>>>>>>   len::Int
>>>>>>> end
>>>>>>>
>>>>>>> With a few extra method definitions, this type looks very close to a 
>>>>>>> real string type, but we can construct it by pointing directly to the 
>>>>>>> mmapped region (which currently isn't possible for native Julia string 
>>>>>>> types). See https://github.com/quinnj/Strings.jl for more 
>>>>>>> brainstorming around this alternative string implementation. You can 
>>>>>>> convert a CString to a Julia string by calling string(x::CString) or 
>>>>>>> map(string,column) for an Array of CSV.CStrings.
>>>>>>>
>>>>>>> As an update on the performance on the Facebook Kaggle competition 
>>>>>>> bids.csv file:
>>>>>>>
>>>>>>> -readcsv: 45 seconds, 33% gc time
>>>>>>> -CSV.read: 19 seconds, 3% gc time
>>>>>>> -SQLite.create: 25 seconds, 3.25% gc time
>>>>>>>
>>>>>>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be 
>>>>>>> released officially, but it's that last 10-20% that's always the 
>>>>>>> hardest to 
>>>>>>> finish up :)
>>>>>>>
>>>>>>> -Jacob
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @Jacob I'm just developing a working understanding of these issues. 
>>>>>>>> Would you please help me to get a better handle on your solution?
>>>>>>>>
>>>>>>>> My understanding thus far: Reading a (local) .csv file into a 
>>>>>>>> DataFrame using DataFrames.readtable involves reading the file into an 
>>>>>>>> IOStream and then parsing that stream into a form amenable to parsing 
>>>>>>>> by 
>>>>>>>> DataFrames.builddf, which builds the DataFrame object returned by 
>>>>>>>> readtable. The work required to get the contents of the .csv file into 
>>>>>>>> memory in a form that can be manipulated by Julia functions is 
>>>>>>>> work-intensive in this manner. However, with SQLite, the entire file 
>>>>>>>> can 
>>>>>>>> just be thrown into memory wholesale, along with some metadata (maybe 
>>>>>>>> not 
>>>>>>>> the right term?) that delineates the tabular properties of the data. 
>>>>>>>>
>>>>>>>> What I am curious about, then (if this understanding is not too 
>>>>>>>> misguided), is how SQLite returns, say, a column of data that doesn't 
>>>>>>>> include, say, a bunch of delimiters. That is, what sort of parsing 
>>>>>>>> *does* 
>>>>>>>> SQLite do, and when?
>>>>>>>>
>>>>>>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote:
>>>>>>>>>
>>>>>>>>> The biggest single advantage SQLite has is the ability to mmap a 
>>>>>>>>> file and just tell SQLite which pointer addresses start strings and 
>>>>>>>>> how 
>>>>>>>>> long they are, all without copying. The huge, huge bottleneck in most 
>>>>>>>>> implementations, is not just identifying where a string starts and 
>>>>>>>>> how long 
>>>>>>>>> it is, but then allocating "in program" memory and copying the string 
>>>>>>>>> into 
>>>>>>>>> it. With SQLite, we can use an in-memory database, mmap the file, and 
>>>>>>>>> tell 
>>>>>>>>> SQLite where each string for a column lives by giving it the starting 
>>>>>>>>> pointer address and how long it is. I've been looking into how to 
>>>>>>>>> solve 
>>>>>>>>> this problem over the last month or so (apart from Oscar's gc 
>>>>>>>>> wizardry) and 
>>>>>>>>> it just occurred to me last week that using SQLite may be the best 
>>>>>>>>> way; so 
>>>>>>>>> far, the results are promising!
>>>>>>>>>
>>>>>>>>> -Jacob
>>>>>>>>>
>>>>>>>>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Great, thank you Jacob, I will try it out! 
>>>>>>>>>>
>>>>>>>>>> Do you have a writeup on differences in the way you read CSV 
>>>>>>>>>> files and the way it is currently done in Julia? Would love to know 
>>>>>>>>>> more!
>>>>>>>>>>
>>>>>>>>>> Obvious perhaps but for completeness: Reading the data using 
>>>>>>>>>> readcsv or readdlm does not improve much the metrics I reported, 
>>>>>>>>>> suggesting 
>>>>>>>>>> that the overhead from DataFrames is not much.
>>>>>>>>>>
>>>>>>>>>> Thank you again!
>>>>>>>>>>
>>>>>>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>>>>>>>>>>>
>>>>>>>>>>> I've been meaning to clean some things up and properly release 
>>>>>>>>>>> the functionality, but I have a new way to read in CSV files that 
>>>>>>>>>>> beats 
>>>>>>>>>>> anything else out there that I know of. To get the functionality, 
>>>>>>>>>>> you'll 
>>>>>>>>>>> need to be running 0.4 master, then do
>>>>>>>>>>>
>>>>>>>>>>> Pkg.add("SQLite")
>>>>>>>>>>> Pkg.checkout("SQLite","jq/updates")
>>>>>>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>>>>>>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>>>>>>>>>>>
>>>>>>>>>>> I then ran the following on the bids.csv file
>>>>>>>>>>>
>>>>>>>>>>> using SQLite, CSV
>>>>>>>>>>>
>>>>>>>>>>> db = SQLite.SQLiteDB()
>>>>>>>>>>>
>>>>>>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>>>>>>>>>>>
>>>>>>>>>>> @time lines = SQLite.create(db, ff,"temp2")
>>>>>>>>>>>
>>>>>>>>>>> It took 18 seconds on my newish MBP. From the R data.table 
>>>>>>>>>>> package, the `fread` is the other fastest CSV I know of and it took 
>>>>>>>>>>> 34 
>>>>>>>>>>> seconds on my machine. I'm actually pretty surprised by that, since 
>>>>>>>>>>> in 
>>>>>>>>>>> other tests I've done it was on par with the SQLite+CSV or 
>>>>>>>>>>> sometimes 
>>>>>>>>>>> slightly faster.
>>>>>>>>>>>
>>>>>>>>>>> Now, you're not necessarily getting a Julia structure in this 
>>>>>>>>>>> case, but it's loading the data into an SQLite table, that you can 
>>>>>>>>>>> then run 
>>>>>>>>>>> SQLite.query(db, sql_string) to do manipulations and such.
>>>>>>>>>>>
>>>>>>>>>>> -Jacob
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not 
>>>>>>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 
>>>>>>>>>>>> 11:26 
>>>>>>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3).
>>>>>>>>>>>>
>>>>>>>>>>>> I tried other releases as Tim suggested:
>>>>>>>>>>>>
>>>>>>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit 
>>>>>>>>>>>> bfa8648* (0 days old master), 
>>>>>>>>>>>> the same command takes 14 minutes - half that it was taking 
>>>>>>>>>>>> with release-0.3 but still 3 times more than that taken by R's 
>>>>>>>>>>>> read.csv (5 
>>>>>>>>>>>> min). More important, Julia process takes up 8GB memory (Rsession 
>>>>>>>>>>>> takes 
>>>>>>>>>>>> 1.6GB)
>>>>>>>>>>>> output of the command `@time DataFrames.readtable("bids.csv");` 
>>>>>>>>>>>> is
>>>>>>>>>>>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc 
>>>>>>>>>>>> time) # reduced from 85% to 71%
>>>>>>>>>>>>
>>>>>>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 
>>>>>>>>>>>> UTC) ob/gctune/238ed08* (fork: 1 commits, 39 days), the command 
>>>>>>>>>>>> `@time 
>>>>>>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of 
>>>>>>>>>>>> the 
>>>>>>>>>>>> macro is: 
>>>>>>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 
>>>>>>>>>>>> 76.58% gc time in 33 pauses with 31 full sweep)
>>>>>>>>>>>> The process also takes up 8GB memory on the machine, more than 
>>>>>>>>>>>> the earlier one. My machine has also significantly slowed down - 
>>>>>>>>>>>> so perhaps 
>>>>>>>>>>>> the increase in memory when compared to release-0.3 is significant.
>>>>>>>>>>>>
>>>>>>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; 
>>>>>>>>>>>> so its not an option for now.
>>>>>>>>>>>>
>>>>>>>>>>>> Is this the best one can expect for now? I read the discussion 
>>>>>>>>>>>> on issue #10428 but I did not understand it well :-(
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not ideal, but for now you can try turning off the garbage 
>>>>>>>>>>>>> collection while reading in the DataFrame.
>>>>>>>>>>>>>
>>>>>>>>>>>>> gc_disable()
>>>>>>>>>>>>> df = DataFrames.readtable("bids.csv")
>>>>>>>>>>>>> gc_enable()
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jiahao Chen
>>>>>>>>>>>>> Research Scientist
>>>>>>>>>>>>> MIT CSAIL
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If you're using julia 0.3, you might want to try current 
>>>>>>>>>>>>>> master and/or
>>>>>>>>>>>>>> possibly the "ob/gctune" branch.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> --Tim
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 
>>>>>>>>>>>>>> rows with 9 columns
>>>>>>>>>>>>>> > (mostly
>>>>>>>>>>>>>> > strings). 
>>>>>>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and 
>>>>>>>>>>>>>> the resulting
>>>>>>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB 
>>>>>>>>>>>>>> memory on my machine)
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > user   system elapsed
>>>>>>>>>>>>>> > 332.295   4.154 343.332
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > > object.size(a)
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 601496056 bytes #(0.6 GB)
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes 
>>>>>>>>>>>>>> about 30 minutes
>>>>>>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) 
>>>>>>>>>>>>>> and the resulting
>>>>>>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the 
>>>>>>>>>>>>>> same machine)
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > (I added couple of calls to @time macro inside the 
>>>>>>>>>>>>>> readtable function to
>>>>>>>>>>>>>> > see whats taking time - outcomes of these calls too are 
>>>>>>>>>>>>>> below)
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>>>>>>>>>>>> > WARNING: Begin readnrows call
>>>>>>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes 
>>>>>>>>>>>>>> allocated, 0.35% gc
>>>>>>>>>>>>>> > time)
>>>>>>>>>>>>>> > WARNING: End readnrows call
>>>>>>>>>>>>>> > WARNING: Begin builddf call
>>>>>>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes 
>>>>>>>>>>>>>> allocated, 85.54%
>>>>>>>>>>>>>> > gc time)
>>>>>>>>>>>>>> > WARNING: End builddf call
>>>>>>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes 
>>>>>>>>>>>>>> allocated, 84.12%
>>>>>>>>>>>>>> > gc time) #total time for loading
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Can you please suggest how I can improve load time and 
>>>>>>>>>>>>>> memory usage in
>>>>>>>>>>>>>> > DataFrames for sizes this big and bigger?
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thank you!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>
>

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Reply via email to