Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

bernhard Tue, 06 Oct 2015 23:34:15 -0700

Is there any update on this? Or maybe a timeline/roadmap?
I would love to see a faster CSV reader.


I tried to take a look at Jacob's CSV.jl.
But I seem to be missing https://github.com/lindahua/DataStreams.jl 
I have no idea where to find DataStreams package....
Does it still exist?

Is there any (experimental) way to make CSV.jl work?



> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold:
>
> @Jacob,
>
> Thank you very much for your explanation! I expect having such a blueprint 
> will make delving into the actual code more tractable for me. I'll be 
> curious to see how your solution here and your proposal for string handling 
> end up playing with the current Julia data ecosystem. 
>
> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote:
>>
>> @David,
>>
>> Sorry for the slow response. It's been a busy week :)
>>
>> Here's a quick rundown of the approach:
>>
>> - In the still-yet-to-be-officially-published 
>> https://github.com/quinnj/CSV.jl package, the bulk of the code goes into 
>> creating a `CSV.File` type where the structure/metadata of the file is 
>> parsed/detected/saved in a type (e.g. header, delimiter, newline, # of 
>> columns, detected column types, etc.)
>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input and 
>> follow a similar process in parsing:
>>   - The actual file contents are mmapped; i.e. the entire file is loaded 
>> into memory at once
>>   - There are currently three `readfield` methods (Int,Float64,String) 
>> that take an open `CSV.Stream` type (which holds the mmapped data and the 
>> current "position" of parsing), and read a single field according to what 
>> the type of that column is supposed to be
>>       - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, 
>> col), will start reading at the current position of the `CSV.Stream` until 
>> it hits the next delimiter, newline, or end of the file and then interpret 
>> the contents as a Float64, returning `val, isnull`
>>
>> That's pretty much it. One of the most critical performance keys for both 
>> SQLite and CSV.read is non-copying strings once the file has been mmapped. 
>> For SQLite, the sqlite3_bind_text library method actually has a flag to 
>> indicate whether the text should be copied or not, so we're able to pass 
>> the pointer to the position in the mmapped array directly. For the CSV.read 
>> method, which returns a Vector of the columns (as typed arrays), I've 
>> actually rolled a quick and dirty CString type that looks like
>>
>> immutable CString
>>   ptr::Ptr{UInt8}
>>   len::Int
>> end
>>
>> With a few extra method definitions, this type looks very close to a real 
>> string type, but we can construct it by pointing directly to the mmapped 
>> region (which currently isn't possible for native Julia string types). See 
>> https://github.com/quinnj/Strings.jl for more brainstorming around this 
>> alternative string implementation. You can convert a CString to a Julia 
>> string by calling string(x::CString) or map(string,column) for an Array of 
>> CSV.CStrings.
>>
>> As an update on the performance on the Facebook Kaggle competition 
>> bids.csv file:
>>
>> -readcsv: 45 seconds, 33% gc time
>> -CSV.read: 19 seconds, 3% gc time
>> -SQLite.create: 25 seconds, 3.25% gc time
>>
>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be released 
>> officially, but it's that last 10-20% that's always the hardest to finish 
>> up :)
>>
>> -Jacob
>>
>>
>>
>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]> wrote:
>>
>>> @Jacob I'm just developing a working understanding of these issues. 
>>> Would you please help me to get a better handle on your solution?
>>>
>>> My understanding thus far: Reading a (local) .csv file into a DataFrame 
>>> using DataFrames.readtable involves reading the file into an IOStream and 
>>> then parsing that stream into a form amenable to parsing by 
>>> DataFrames.builddf, which builds the DataFrame object returned by 
>>> readtable. The work required to get the contents of the .csv file into 
>>> memory in a form that can be manipulated by Julia functions is 
>>> work-intensive in this manner. However, with SQLite, the entire file can 
>>> just be thrown into memory wholesale, along with some metadata (maybe not 
>>> the right term?) that delineates the tabular properties of the data. 
>>>
>>> What I am curious about, then (if this understanding is not too 
>>> misguided), is how SQLite returns, say, a column of data that doesn't 
>>> include, say, a bunch of delimiters. That is, what sort of parsing *does* 
>>> SQLite do, and when?
>>>
>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote:
>>>>
>>>> The biggest single advantage SQLite has is the ability to mmap a file 
>>>> and just tell SQLite which pointer addresses start strings and how long 
>>>> they are, all without copying. The huge, huge bottleneck in most 
>>>> implementations, is not just identifying where a string starts and how 
>>>> long 
>>>> it is, but then allocating "in program" memory and copying the string into 
>>>> it. With SQLite, we can use an in-memory database, mmap the file, and tell 
>>>> SQLite where each string for a column lives by giving it the starting 
>>>> pointer address and how long it is. I've been looking into how to solve 
>>>> this problem over the last month or so (apart from Oscar's gc wizardry) 
>>>> and 
>>>> it just occurred to me last week that using SQLite may be the best way; so 
>>>> far, the results are promising!
>>>>
>>>> -Jacob
>>>>
>>>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote:
>>>>
>>>>> Great, thank you Jacob, I will try it out! 
>>>>>
>>>>> Do you have a writeup on differences in the way you read CSV files and 
>>>>> the way it is currently done in Julia? Would love to know more!
>>>>>
>>>>> Obvious perhaps but for completeness: Reading the data using readcsv 
>>>>> or readdlm does not improve much the metrics I reported, suggesting that 
>>>>> the overhead from DataFrames is not much.
>>>>>
>>>>> Thank you again!
>>>>>
>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>>>>>>
>>>>>> I've been meaning to clean some things up and properly release the 
>>>>>> functionality, but I have a new way to read in CSV files that beats 
>>>>>> anything else out there that I know of. To get the functionality, you'll 
>>>>>> need to be running 0.4 master, then do
>>>>>>
>>>>>> Pkg.add("SQLite")
>>>>>> Pkg.checkout("SQLite","jq/updates")
>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>>>>>>
>>>>>> I then ran the following on the bids.csv file
>>>>>>
>>>>>> using SQLite, CSV
>>>>>>
>>>>>> db = SQLite.SQLiteDB()
>>>>>>
>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>>>>>>
>>>>>> @time lines = SQLite.create(db, ff,"temp2")
>>>>>>
>>>>>> It took 18 seconds on my newish MBP. From the R data.table package, 
>>>>>> the `fread` is the other fastest CSV I know of and it took 34 seconds on 
>>>>>> my 
>>>>>> machine. I'm actually pretty surprised by that, since in other tests 
>>>>>> I've 
>>>>>> done it was on par with the SQLite+CSV or sometimes slightly faster.
>>>>>>
>>>>>> Now, you're not necessarily getting a Julia structure in this case, 
>>>>>> but it's loading the data into an SQLite table, that you can then run 
>>>>>> SQLite.query(db, sql_string) to do manipulations and such.
>>>>>>
>>>>>> -Jacob
>>>>>>
>>>>>>
>>>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote:
>>>>>>
>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not 
>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 
>>>>>>> 11:26 
>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3).
>>>>>>>
>>>>>>> I tried other releases as Tim suggested:
>>>>>>>
>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 
>>>>>>> days old master), 
>>>>>>> the same command takes 14 minutes - half that it was taking with 
>>>>>>> release-0.3 but still 3 times more than that taken by R's read.csv (5 
>>>>>>> min). 
>>>>>>> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
>>>>>>> output of the command `@time DataFrames.readtable("bids.csv");` is
>>>>>>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc time) # 
>>>>>>> reduced from 85% to 71%
>>>>>>>
>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) 
>>>>>>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time 
>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the 
>>>>>>> macro is: 
>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc 
>>>>>>> time in 33 pauses with 31 full sweep)
>>>>>>> The process also takes up 8GB memory on the machine, more than the 
>>>>>>> earlier one. My machine has also significantly slowed down - so perhaps 
>>>>>>> the 
>>>>>>> increase in memory when compared to release-0.3 is significant.
>>>>>>>
>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so its 
>>>>>>> not an option for now.
>>>>>>>
>>>>>>> Is this the best one can expect for now? I read the discussion on 
>>>>>>> issue #10428 but I did not understand it well :-(
>>>>>>>
>>>>>>> Thank you!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>>>>>>
>>>>>>>> Not ideal, but for now you can try turning off the garbage 
>>>>>>>> collection while reading in the DataFrame.
>>>>>>>>
>>>>>>>> gc_disable()
>>>>>>>> df = DataFrames.readtable("bids.csv")
>>>>>>>> gc_enable()
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jiahao Chen
>>>>>>>> Research Scientist
>>>>>>>> MIT CSAIL
>>>>>>>>
>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> If you're using julia 0.3, you might want to try current master 
>>>>>>>>> and/or
>>>>>>>>> possibly the "ob/gctune" branch.
>>>>>>>>>
>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> --Tim
>>>>>>>>>
>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows 
>>>>>>>>> with 9 columns
>>>>>>>>> > (mostly
>>>>>>>>> > strings). 
>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>>>>>>> >
>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and the 
>>>>>>>>> resulting
>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on 
>>>>>>>>> my machine)
>>>>>>>>> >
>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>>>>>>> >
>>>>>>>>> > user   system elapsed
>>>>>>>>> > 332.295   4.154 343.332
>>>>>>>>> >
>>>>>>>>> > > object.size(a)
>>>>>>>>> >
>>>>>>>>> > 601496056 bytes #(0.6 GB)
>>>>>>>>> >
>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes about 
>>>>>>>>> 30 minutes
>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and the 
>>>>>>>>> resulting
>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same 
>>>>>>>>> machine)
>>>>>>>>> >
>>>>>>>>> > (I added couple of calls to @time macro inside the readtable 
>>>>>>>>> function to
>>>>>>>>> > see whats taking time - outcomes of these calls too are below)
>>>>>>>>> >
>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>>>>>>> > WARNING: Begin readnrows call
>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 
>>>>>>>>> 0.35% gc
>>>>>>>>> > time)
>>>>>>>>> > WARNING: End readnrows call
>>>>>>>>> > WARNING: Begin builddf call
>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes 
>>>>>>>>> allocated, 85.54%
>>>>>>>>> > gc time)
>>>>>>>>> > WARNING: End builddf call
>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes 
>>>>>>>>> allocated, 84.12%
>>>>>>>>> > gc time) #total time for loading
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > Can you please suggest how I can improve load time and memory 
>>>>>>>>> usage in
>>>>>>>>> > DataFrames for sizes this big and bigger?
>>>>>>>>> >
>>>>>>>>> > Thank you!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Re: [julia-users] DataFrames' readtable very slow compared to R's read.csv when loading ~7.6M csv rows

Reply via email to