Pushed some fixes. Thanks for trying it out. -Jacob
On Wed, Oct 7, 2015 at 11:54 PM, bernhard <[email protected]> wrote: > Thank you Quinn > > Things do not work (for me) though. > > is it possible you are missing a comma after "col" in lines 24 and 33 of > Sink.jl > function writefield(io::Sink, val::AbstractString, col N) > > > > Am Mittwoch, 7. Oktober 2015 16:36:52 UTC+2 schrieb David Gold: >> >> Yaas. Very excited to see this. >> >> On Wednesday, October 7, 2015 at 6:07:44 AM UTC-7, Jacob Quinn wrote: >>> >>> Haha, nice timing. I just pushed a big CSV.jl overhaul for 0.4 yesterday >>> afternoon. I just pushed the DataStreams.jl package, so you can find that >>> at https://github.com/quinnj/DataStreams.jl, and you'll have to >>> Pkg.clone it. Everything should work at that point. >>> >>> I'm still cleaning up some other related packages, so that's why things >>> aren't documented/registered/tagged quite yet as the interface may evolve >>> slightly, probably more the low-level machinery. So `stream!(::CSV.Source, >>> ::DataStream)` should stay the same. >>> >>> I've already got a bit writeup started once everything's done, so if >>> you'd rather wait another couple days or a week, I should have something >>> ready by then. >>> >>> -Jacob >>> >>> On Wed, Oct 7, 2015 at 12:33 AM, bernhard <[email protected]> wrote: >>> >>>> Is there any update on this? Or maybe a timeline/roadmap? >>>> I would love to see a faster CSV reader. >>>> >>>> I tried to take a look at Jacob's CSV.jl. >>>> But I seem to be missing https://github.com/lindahua/DataStreams.jl >>>> I have no idea where to find DataStreams package.... >>>> Does it still exist? >>>> >>>> Is there any (experimental) way to make CSV.jl work? >>>> >>>> >>>> >>>>> Am Samstag, 6. Juni 2015 14:41:36 UTC+2 schrieb David Gold: >>>>> >>>>> @Jacob, >>>>> >>>>> Thank you very much for your explanation! I expect having such a >>>>> blueprint will make delving into the actual code more tractable for me. >>>>> I'll be curious to see how your solution here and your proposal for string >>>>> handling end up playing with the current Julia data ecosystem. >>>>> >>>>> On Saturday, June 6, 2015 at 1:17:34 AM UTC-4, Jacob Quinn wrote: >>>>>> >>>>>> @David, >>>>>> >>>>>> Sorry for the slow response. It's been a busy week :) >>>>>> >>>>>> Here's a quick rundown of the approach: >>>>>> >>>>>> - In the still-yet-to-be-officially-published >>>>>> https://github.com/quinnj/CSV.jl package, the bulk of the code goes >>>>>> into creating a `CSV.File` type where the structure/metadata of the file >>>>>> is >>>>>> parsed/detected/saved in a type (e.g. header, delimiter, newline, # of >>>>>> columns, detected column types, etc.) >>>>>> - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input >>>>>> and follow a similar process in parsing: >>>>>> - The actual file contents are mmapped; i.e. the entire file is >>>>>> loaded into memory at once >>>>>> - There are currently three `readfield` methods >>>>>> (Int,Float64,String) that take an open `CSV.Stream` type (which holds the >>>>>> mmapped data and the current "position" of parsing), and read a single >>>>>> field according to what the type of that column is supposed to be >>>>>> - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, >>>>>> col), will start reading at the current position of the `CSV.Stream` >>>>>> until >>>>>> it hits the next delimiter, newline, or end of the file and then >>>>>> interpret >>>>>> the contents as a Float64, returning `val, isnull` >>>>>> >>>>>> That's pretty much it. One of the most critical performance keys for >>>>>> both SQLite and CSV.read is non-copying strings once the file has been >>>>>> mmapped. For SQLite, the sqlite3_bind_text library method actually has a >>>>>> flag to indicate whether the text should be copied or not, so we're able >>>>>> to >>>>>> pass the pointer to the position in the mmapped array directly. For the >>>>>> CSV.read method, which returns a Vector of the columns (as typed arrays), >>>>>> I've actually rolled a quick and dirty CString type that looks like >>>>>> >>>>>> immutable CString >>>>>> ptr::Ptr{UInt8} >>>>>> len::Int >>>>>> end >>>>>> >>>>>> With a few extra method definitions, this type looks very close to a >>>>>> real string type, but we can construct it by pointing directly to the >>>>>> mmapped region (which currently isn't possible for native Julia string >>>>>> types). See https://github.com/quinnj/Strings.jl for more >>>>>> brainstorming around this alternative string implementation. You can >>>>>> convert a CString to a Julia string by calling string(x::CString) or >>>>>> map(string,column) for an Array of CSV.CStrings. >>>>>> >>>>>> As an update on the performance on the Facebook Kaggle competition >>>>>> bids.csv file: >>>>>> >>>>>> -readcsv: 45 seconds, 33% gc time >>>>>> -CSV.read: 19 seconds, 3% gc time >>>>>> -SQLite.create: 25 seconds, 3.25% gc time >>>>>> >>>>>> Anyway, hopefully I'll get around to cleaning up CSV.jl to be >>>>>> released officially, but it's that last 10-20% that's always the hardest >>>>>> to >>>>>> finish up :) >>>>>> >>>>>> -Jacob >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> @Jacob I'm just developing a working understanding of these issues. >>>>>>> Would you please help me to get a better handle on your solution? >>>>>>> >>>>>>> My understanding thus far: Reading a (local) .csv file into a >>>>>>> DataFrame using DataFrames.readtable involves reading the file into an >>>>>>> IOStream and then parsing that stream into a form amenable to parsing by >>>>>>> DataFrames.builddf, which builds the DataFrame object returned by >>>>>>> readtable. The work required to get the contents of the .csv file into >>>>>>> memory in a form that can be manipulated by Julia functions is >>>>>>> work-intensive in this manner. However, with SQLite, the entire file can >>>>>>> just be thrown into memory wholesale, along with some metadata (maybe >>>>>>> not >>>>>>> the right term?) that delineates the tabular properties of the data. >>>>>>> >>>>>>> What I am curious about, then (if this understanding is not too >>>>>>> misguided), is how SQLite returns, say, a column of data that doesn't >>>>>>> include, say, a bunch of delimiters. That is, what sort of parsing >>>>>>> *does* >>>>>>> SQLite do, and when? >>>>>>> >>>>>>> On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote: >>>>>>>> >>>>>>>> The biggest single advantage SQLite has is the ability to mmap a >>>>>>>> file and just tell SQLite which pointer addresses start strings and how >>>>>>>> long they are, all without copying. The huge, huge bottleneck in most >>>>>>>> implementations, is not just identifying where a string starts and how >>>>>>>> long >>>>>>>> it is, but then allocating "in program" memory and copying the string >>>>>>>> into >>>>>>>> it. With SQLite, we can use an in-memory database, mmap the file, and >>>>>>>> tell >>>>>>>> SQLite where each string for a column lives by giving it the starting >>>>>>>> pointer address and how long it is. I've been looking into how to solve >>>>>>>> this problem over the last month or so (apart from Oscar's gc >>>>>>>> wizardry) and >>>>>>>> it just occurred to me last week that using SQLite may be the best >>>>>>>> way; so >>>>>>>> far, the results are promising! >>>>>>>> >>>>>>>> -Jacob >>>>>>>> >>>>>>>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote: >>>>>>>> >>>>>>>>> Great, thank you Jacob, I will try it out! >>>>>>>>> >>>>>>>>> Do you have a writeup on differences in the way you read CSV files >>>>>>>>> and the way it is currently done in Julia? Would love to know more! >>>>>>>>> >>>>>>>>> Obvious perhaps but for completeness: Reading the data using >>>>>>>>> readcsv or readdlm does not improve much the metrics I reported, >>>>>>>>> suggesting >>>>>>>>> that the overhead from DataFrames is not much. >>>>>>>>> >>>>>>>>> Thank you again! >>>>>>>>> >>>>>>>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote: >>>>>>>>>> >>>>>>>>>> I've been meaning to clean some things up and properly release >>>>>>>>>> the functionality, but I have a new way to read in CSV files that >>>>>>>>>> beats >>>>>>>>>> anything else out there that I know of. To get the functionality, >>>>>>>>>> you'll >>>>>>>>>> need to be running 0.4 master, then do >>>>>>>>>> >>>>>>>>>> Pkg.add("SQLite") >>>>>>>>>> Pkg.checkout("SQLite","jq/updates") >>>>>>>>>> Pkg.clone("https://github.com/quinnj/CSV.jl") >>>>>>>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl") >>>>>>>>>> >>>>>>>>>> I then ran the following on the bids.csv file >>>>>>>>>> >>>>>>>>>> using SQLite, CSV >>>>>>>>>> >>>>>>>>>> db = SQLite.SQLiteDB() >>>>>>>>>> >>>>>>>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv") >>>>>>>>>> >>>>>>>>>> @time lines = SQLite.create(db, ff,"temp2") >>>>>>>>>> >>>>>>>>>> It took 18 seconds on my newish MBP. From the R data.table >>>>>>>>>> package, the `fread` is the other fastest CSV I know of and it took >>>>>>>>>> 34 >>>>>>>>>> seconds on my machine. I'm actually pretty surprised by that, since >>>>>>>>>> in >>>>>>>>>> other tests I've done it was on par with the SQLite+CSV or sometimes >>>>>>>>>> slightly faster. >>>>>>>>>> >>>>>>>>>> Now, you're not necessarily getting a Julia structure in this >>>>>>>>>> case, but it's loading the data into an SQLite table, that you can >>>>>>>>>> then run >>>>>>>>>> SQLite.query(db, sql_string) to do manipulations and such. >>>>>>>>>> >>>>>>>>>> -Jacob >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not >>>>>>>>>>> mention in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 >>>>>>>>>>> 11:26 >>>>>>>>>>> UTC) Commit 80dd75c* (1 day old release-0.3). >>>>>>>>>>> >>>>>>>>>>> I tried other releases as Tim suggested: >>>>>>>>>>> >>>>>>>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* >>>>>>>>>>> (0 days old master), >>>>>>>>>>> the same command takes 14 minutes - half that it was taking with >>>>>>>>>>> release-0.3 but still 3 times more than that taken by R's read.csv >>>>>>>>>>> (5 min). >>>>>>>>>>> More important, Julia process takes up 8GB memory (Rsession takes >>>>>>>>>>> 1.6GB) >>>>>>>>>>> output of the command `@time DataFrames.readtable("bids.csv");` >>>>>>>>>>> is >>>>>>>>>>> 857.120 seconds (352 M allocations: 16601 MB, 71.59% gc >>>>>>>>>>> time) # reduced from 85% to 71% >>>>>>>>>>> >>>>>>>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 >>>>>>>>>>> UTC) ob/gctune/238ed08* (fork: 1 commits, 39 days), the command >>>>>>>>>>> `@time >>>>>>>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of >>>>>>>>>>> the >>>>>>>>>>> macro is: >>>>>>>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% >>>>>>>>>>> gc time in 33 pauses with 31 full sweep) >>>>>>>>>>> The process also takes up 8GB memory on the machine, more than >>>>>>>>>>> the earlier one. My machine has also significantly slowed down - so >>>>>>>>>>> perhaps >>>>>>>>>>> the increase in memory when compared to release-0.3 is significant. >>>>>>>>>>> >>>>>>>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so >>>>>>>>>>> its not an option for now. >>>>>>>>>>> >>>>>>>>>>> Is this the best one can expect for now? I read the discussion >>>>>>>>>>> on issue #10428 but I did not understand it well :-( >>>>>>>>>>> >>>>>>>>>>> Thank you! >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote: >>>>>>>>>>>> >>>>>>>>>>>> Not ideal, but for now you can try turning off the garbage >>>>>>>>>>>> collection while reading in the DataFrame. >>>>>>>>>>>> >>>>>>>>>>>> gc_disable() >>>>>>>>>>>> df = DataFrames.readtable("bids.csv") >>>>>>>>>>>> gc_enable() >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> >>>>>>>>>>>> Jiahao Chen >>>>>>>>>>>> Research Scientist >>>>>>>>>>>> MIT CSAIL >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If you're using julia 0.3, you might want to try current >>>>>>>>>>>>> master and/or >>>>>>>>>>>>> possibly the "ob/gctune" branch. >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/JuliaLang/julia/issues/10428 >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> --Tim >>>>>>>>>>>>> >>>>>>>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote: >>>>>>>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows >>>>>>>>>>>>> with 9 columns >>>>>>>>>>>>> > (mostly >>>>>>>>>>>>> > strings). >>>>>>>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data >>>>>>>>>>>>> > >>>>>>>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and >>>>>>>>>>>>> the resulting >>>>>>>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory >>>>>>>>>>>>> on my machine) >>>>>>>>>>>>> > >>>>>>>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0 >>>>>>>>>>>>> > >>>>>>>>>>>>> > user system elapsed >>>>>>>>>>>>> > 332.295 4.154 343.332 >>>>>>>>>>>>> > >>>>>>>>>>>>> > > object.size(a) >>>>>>>>>>>>> > >>>>>>>>>>>>> > 601496056 bytes #(0.6 GB) >>>>>>>>>>>>> > >>>>>>>>>>>>> > Loading the same dataset using DataFrames' readtable takes >>>>>>>>>>>>> about 30 minutes >>>>>>>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and >>>>>>>>>>>>> the resulting >>>>>>>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the >>>>>>>>>>>>> same machine) >>>>>>>>>>>>> > >>>>>>>>>>>>> > (I added couple of calls to @time macro inside the readtable >>>>>>>>>>>>> function to >>>>>>>>>>>>> > see whats taking time - outcomes of these calls too are >>>>>>>>>>>>> below) >>>>>>>>>>>>> > >>>>>>>>>>>>> > julia> @time DataFrames.readtable("bids.csv"); >>>>>>>>>>>>> > WARNING: Begin readnrows call >>>>>>>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes >>>>>>>>>>>>> allocated, 0.35% gc >>>>>>>>>>>>> > time) >>>>>>>>>>>>> > WARNING: End readnrows call >>>>>>>>>>>>> > WARNING: Begin builddf call >>>>>>>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes >>>>>>>>>>>>> allocated, 85.54% >>>>>>>>>>>>> > gc time) >>>>>>>>>>>>> > WARNING: End builddf call >>>>>>>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes >>>>>>>>>>>>> allocated, 84.12% >>>>>>>>>>>>> > gc time) #total time for loading >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > Can you please suggest how I can improve load time and >>>>>>>>>>>>> memory usage in >>>>>>>>>>>>> > DataFrames for sizes this big and bigger? >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thank you! >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>>
