Thank you it is working now. And it is blazing fast (factor 8 on a 1GB file compared to readcsv or readtable, memory allocation is at 17mb). I love it.
Now I only need to modify the data. Is there any fast way to get an Array or a DataFrame of the imported table? (or would this defeat the purpose?) Eventually I don't need a df but a custom data structure I created. I will do the conversion when I find time to do so. Coincidentally I created an inexcat error in mmap 126 with a 3GB file. I don't have time to find the line which caused it right now though. Bernhard Am Donnerstag, 8. Oktober 2015 14:05:24 UTC+2 schrieb Jacob Quinn: > > Pushed some fixes. Thanks for trying it out. > > -Jacob > > On Wed, Oct 7, 2015 at 11:54 PM, bernhard <[email protected] <javascript:> > > wrote: > >> Thank you Quinn >> >> Things do not work (for me) though. >> >> is it possible you are missing a comma after "col" in lines 24 and 33 of >> Sink.jl >> function writefield(io::Sink, val::AbstractString, col N) >> >> >> >> Am Mittwoch, 7. Oktober 2015 16:36:52 UTC+2 schrieb David Gold: >>> >>> Yaas. Very excited to see this. >>> >>> On Wednesday, October 7, 2015 at 6:07:44 AM UTC-7, Jacob Quinn wrote: >>>> >>>> Haha, nice timing. I just pushed a big CSV.jl overhaul for 0.4 >>>> yesterday afternoon. I just pushed the DataStreams.jl package, so you can >>>> find that at https://github.com/quinnj/DataStreams.jl, and you'll have >>>> to Pkg.clone it. Everything should work at that point. >>>> >>>> I'm still cleaning up some other related packages, so that's why things >>>> aren't documented/registered/tagged quite yet as the interface may evolve >>>> slightly, probably more the low-level machinery. So `stream!(::CSV.Source, >>>> ::DataStream)` should stay the same. >>>> >>>> I've already got a bit writeup started once everything's done, so if >>>> you'd rather wait another couple days or a week, I should have something >>>> ready by then. >>>> >>>> -Jacob >>>> >>>> On Wed, Oct 7, 2015 at 12:33 AM, bernhard <[email protected]> wrote: >>>> >>>>> Is there any update on this? Or maybe a timeline/roadmap? >>>>> I would love to see a faster CSV reader. >>>>> >>>>> I tried to take a look at Jacob's CSV.jl. >>>>> But I seem to be missing https://github.com/lindahua/DataStreams.jl >>>>> I have no idea where to find DataStreams package.... >>>>> Does it still exist? >>>>> >>>>> Is there any (experimental) way to make CSV.jl work? >>>>> >>>>> >>>>> >>>>>> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold: >>>>>> >>>>>> @Jacob, >>>>>> >>>>>> Thank you very much for your explanation! I expect having such a >>>>>> blueprint will make delving into the actual code more tractable for me. >>>>>> I'll be curious to see how your solution here and your proposal for >>>>>> string >>>>>> handling end up playing with the current Julia data ecosystem. >>>>>> >>>>>> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote: >>>>>>> >>>>>>> @David, >>>>>>> >>>>>>> Sorry for the slow response. It's been a busy week :) >>>>>>> >>>>>>> Here's a quick rundown of the approach: >>>>>>> >>>>>>> - In the still-yet-to-be-officially-published >>>>>>> https://github.com/quinnj/CSV.jl package, the bulk of the code goes >>>>>>> into creating a `CSV.File` type where the structure/metadata of the >>>>>>> file is >>>>>>> parsed/detected/saved in a type (e.g. header, delimiter, newline, # of >>>>>>> columns, detected column types, etc.) >>>>>>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input >>>>>>> and follow a similar process in parsing: >>>>>>> - The actual file contents are mmapped; i.e. the entire file is >>>>>>> loaded into memory at once >>>>>>> - There are currently three `readfield` methods >>>>>>> (Int,Float64,String) that take an open `CSV.Stream` type (which holds >>>>>>> the >>>>>>> mmapped data and the current "position" of parsing), and read a single >>>>>>> field according to what the type of that column is supposed to be >>>>>>> - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, >>>>>>> col), will start reading at the current position of the `CSV.Stream` >>>>>>> until >>>>>>> it hits the next delimiter, newline, or end of the file and then >>>>>>> interpret >>>>>>> the contents as a Float64, returning `val, isnull` >>>>>>> >>>>>>> That's pretty much it. One of the most critical performance keys for >>>>>>> both SQLite and CSV.read is non-copying strings once the file has been >>>>>>> mmapped. For SQLite, the sqlite3_bind_text library method actually has >>>>>>> a >>>>>>> flag to indicate whether the text should be copied or not, so we're >>>>>>> able to >>>>>>> pass the pointer to the position in the mmapped array directly. For the >>>>>>> CSV.read method, which returns a Vector of the columns (as typed >>>>>>> arrays), >>>>>>> I've actually rolled a quick and dirty CString type that looks like >>>>>>> >>>>>>> immutable CString >>>>>>> ptr::Ptr{UInt8} >>>>>>> len::Int >>>>>>> end >>>>>>> >>>>>>> With a few extra method definitions, this type looks very close to a >>>>>>> real string type, but we can construct it by pointing directly to the >>>>>>> mmapped region (which currently isn't possible for native Julia string >>>>>>> types). See https://github.com/quinnj/Strings.jl for more >>>>>>> brainstorming around this alternative string implementation. You can >>>>>>> convert a CString to a Julia string by calling string(x::CString) or >>>>>>> map(string,column) for an Array of CSV.CStrings. >>>>>>> >>>>>>> As an update on the performance on the Facebook Kaggle competition >>>>>>> bids.csv file: >>>>>>> >>>>>>> -readcsv: 45 seconds, 33% gc time >>>>>>> -CSV.read: 19 seconds, 3% gc time >>>>>>> -SQLite.create: 25 seconds, 3.25% gc time >>>>>>> >>>>>>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be >>>>>>> released officially, but it's that last 10-20% that's always the >>>>>>> hardest to >>>>>>> finish up :) >>>>>>> >>>>>>> -Jacob >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> @Jacob I'm just developing a working understanding of these issues. >>>>>>>> Would you please help me to get a better handle on your solution? >>>>>>>> >>>>>>>> My understanding thus far: Reading a (local) .csv file into a >>>>>>>> DataFrame using DataFrames.readtable involves reading the file into an >>>>>>>> IOStream and then parsing that stream into a form amenable to parsing >>>>>>>> by >>>>>>>> DataFrames.builddf, which builds the DataFrame object returned by >>>>>>>> readtable. The work required to get the contents of the .csv file into >>>>>>>> memory in a form that can be manipulated by Julia functions is >>>>>>>> work-intensive in this manner. However, with SQLite, the entire file >>>>>>>> can >>>>>>>> just be thrown into memory wholesale, along with some metadata (maybe >>>>>>>> not >>>>>>>> the right term?) that delineates the tabular properties of the data. >>>>>>>> >>>>>>>> What I am curious about, then (if this understanding is not too >>>>>>>> misguided), is how SQLite returns, say, a column of data that doesn't >>>>>>>> include, say, a bunch of delimiters. That is, what sort of parsing >>>>>>>> *does* >>>>>>>> SQLite do, and when? >>>>>>>> >>>>>>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote: >>>>>>>>> >>>>>>>>> The biggest single advantage SQLite has is the ability to mmap a >>>>>>>>> file and just tell SQLite which pointer addresses start strings and >>>>>>>>> how >>>>>>>>> long they are, all without copying. The huge, huge bottleneck in most >>>>>>>>> implementations, is not just identifying where a string starts and >>>>>>>>> how long >>>>>>>>> it is, but then allocating "in program" memory and copying the string >>>>>>>>> into >>>>>>>>> it. With SQLite, we can use an in-memory database, mmap the file, and >>>>>>>>> tell >>>>>>>>> SQLite where each string for a column lives by giving it the starting >>>>>>>>> pointer address and how long it is. I've been looking into how to >>>>>>>>> solve >>>>>>>>> this problem over the last month or so (apart from Oscar's gc >>>>>>>>> wizardry) and >>>>>>>>> it just occurred to me last week that using SQLite may be the best >>>>>>>>> way; so >>>>>>>>> far, the results are promising! >>>>>>>>> >>>>>>>>> -Jacob >>>>>>>>> >>>>>>>>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote: >>>>>>>>> >>>>>>>>>> Great, thank you Jacob, I will try it out! >>>>>>>>>> >>>>>>>>>> Do you have a writeup on differences in the way you read CSV >>>>>>>>>> files and the way it is currently done in Julia? Would love to know >>>>>>>>>> more! >>>>>>>>>> >>>>>>>>>> Obvious perhaps but for completeness: Reading the data using >>>>>>>>>> readcsv or readdlm does not improve much the metrics I reported, >>>>>>>>>> suggesting >>>>>>>>>> that the overhead from DataFrames is not much. >>>>>>>>>> >>>>>>>>>> Thank you again! >>>>>>>>>> >>>>>>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote: >>>>>>>>>>> >>>>>>>>>>> I've been meaning to clean some things up and properly release >>>>>>>>>>> the functionality, but I have a new way to read in CSV files that >>>>>>>>>>> beats >>>>>>>>>>> anything else out there that I know of. To get the functionality, >>>>>>>>>>> you'll >>>>>>>>>>> need to be running 0.4 master, then do >>>>>>>>>>> >>>>>>>>>>> Pkg.add("SQLite") >>>>>>>>>>> Pkg.checkout("SQLite","jq/updates") >>>>>>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl") >>>>>>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl") >>>>>>>>>>> >>>>>>>>>>> I then ran the following on the bids.csv file >>>>>>>>>>> >>>>>>>>>>> using SQLite, CSV >>>>>>>>>>> >>>>>>>>>>> db = SQLite.SQLiteDB() >>>>>>>>>>> >>>>>>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv") >>>>>>>>>>> >>>>>>>>>>> @time lines = SQLite.create(db, ff,"temp2") >>>>>>>>>>> >>>>>>>>>>> It took 18 seconds on my newish MBP. From the R data.table >>>>>>>>>>> package, the `fread` is the other fastest CSV I know of and it took >>>>>>>>>>> 34 >>>>>>>>>>> seconds on my machine. I'm actually pretty surprised by that, since >>>>>>>>>>> in >>>>>>>>>>> other tests I've done it was on par with the SQLite+CSV or >>>>>>>>>>> sometimes >>>>>>>>>>> slightly faster. >>>>>>>>>>> >>>>>>>>>>> Now, you're not necessarily getting a Julia structure in this >>>>>>>>>>> case, but it's loading the data into an SQLite table, that you can >>>>>>>>>>> then run >>>>>>>>>>> SQLite.query(db, sql_string) to do manipulations and such. >>>>>>>>>>> >>>>>>>>>>> -Jacob >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not >>>>>>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 >>>>>>>>>>>> 11:26 >>>>>>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3). >>>>>>>>>>>> >>>>>>>>>>>> I tried other releases as Tim suggested: >>>>>>>>>>>> >>>>>>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit >>>>>>>>>>>> bfa8648* (0 days old master), >>>>>>>>>>>> the same command takes 14 minutes - half that it was taking >>>>>>>>>>>> with release-0.3 but still 3 times more than that taken by R's >>>>>>>>>>>> read.csv (5 >>>>>>>>>>>> min). More important, Julia process takes up 8GB memory (Rsession >>>>>>>>>>>> takes >>>>>>>>>>>> 1.6GB) >>>>>>>>>>>> output of the command `@time DataFrames.readtable("bids.csv");` >>>>>>>>>>>> is >>>>>>>>>>>> 857.120 seconds (352 M allocations: 16601 MB, 71.59% gc >>>>>>>>>>>> time) # reduced from 85% to 71% >>>>>>>>>>>> >>>>>>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 >>>>>>>>>>>> UTC) ob/gctune/238ed08* (fork: 1 commits, 39 days), the command >>>>>>>>>>>> `@time >>>>>>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of >>>>>>>>>>>> the >>>>>>>>>>>> macro is: >>>>>>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, >>>>>>>>>>>> 76.58% gc time in 33 pauses with 31 full sweep) >>>>>>>>>>>> The process also takes up 8GB memory on the machine, more than >>>>>>>>>>>> the earlier one. My machine has also significantly slowed down - >>>>>>>>>>>> so perhaps >>>>>>>>>>>> the increase in memory when compared to release-0.3 is significant. >>>>>>>>>>>> >>>>>>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; >>>>>>>>>>>> so its not an option for now. >>>>>>>>>>>> >>>>>>>>>>>> Is this the best one can expect for now? I read the discussion >>>>>>>>>>>> on issue #10428 but I did not understand it well :-( >>>>>>>>>>>> >>>>>>>>>>>> Thank you! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Not ideal, but for now you can try turning off the garbage >>>>>>>>>>>>> collection while reading in the DataFrame. >>>>>>>>>>>>> >>>>>>>>>>>>> gc_disable() >>>>>>>>>>>>> df = DataFrames.readtable("bids.csv") >>>>>>>>>>>>> gc_enable() >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Jiahao Chen >>>>>>>>>>>>> Research Scientist >>>>>>>>>>>>> MIT CSAIL >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> If you're using julia 0.3, you might want to try current >>>>>>>>>>>>>> master and/or >>>>>>>>>>>>>> possibly the "ob/gctune" branch. >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> --Tim >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 >>>>>>>>>>>>>> rows with 9 columns >>>>>>>>>>>>>> > (mostly >>>>>>>>>>>>>> > strings). >>>>>>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and >>>>>>>>>>>>>> the resulting >>>>>>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB >>>>>>>>>>>>>> memory on my machine) >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0 >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > user system elapsed >>>>>>>>>>>>>> > 332.295 4.154 343.332 >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > > object.size(a) >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > 601496056 bytes #(0.6 GB) >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes >>>>>>>>>>>>>> about 30 minutes >>>>>>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) >>>>>>>>>>>>>> and the resulting >>>>>>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the >>>>>>>>>>>>>> same machine) >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > (I added couple of calls to @time macro inside the >>>>>>>>>>>>>> readtable function to >>>>>>>>>>>>>> > see whats taking time - outcomes of these calls too are >>>>>>>>>>>>>> below) >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv"); >>>>>>>>>>>>>> > WARNING: Begin readnrows call >>>>>>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes >>>>>>>>>>>>>> allocated, 0.35% gc >>>>>>>>>>>>>> > time) >>>>>>>>>>>>>> > WARNING: End readnrows call >>>>>>>>>>>>>> > WARNING: Begin builddf call >>>>>>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes >>>>>>>>>>>>>> allocated, 85.54% >>>>>>>>>>>>>> > gc time) >>>>>>>>>>>>>> > WARNING: End builddf call >>>>>>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes >>>>>>>>>>>>>> allocated, 84.12% >>>>>>>>>>>>>> > gc time) #total time for loading >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Can you please suggest how I can improve load time and >>>>>>>>>>>>>> memory usage in >>>>>>>>>>>>>> > DataFrames for sizes this big and bigger? >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Thank you! >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>> >
