@David, Sorry for the slow response. It's been a busy week :)
Here's a quick rundown of the approach: - In the still-yet-to-be-officially-published https://github.com/quinnj/CSV.jl package, the bulk of the code goes into creating a `CSV.File` type where the structure/metadata of the file is parsed/detected/saved in a type (e.g. header, delimiter, newline, # of columns, detected column types, etc.) - `SQLite.create` and now `CSV.read` both take a `CSV.File` as input and follow a similar process in parsing: - The actual file contents are mmapped; i.e. the entire file is loaded into memory at once - There are currently three `readfield` methods (Int,Float64,String) that take an open `CSV.Stream` type (which holds the mmapped data and the current "position" of parsing), and read a single field according to what the type of that column is supposed to be - for example, readfield(io::CSV.Stream, ::Type{Float64}, row, col), will start reading at the current position of the `CSV.Stream` until it hits the next delimiter, newline, or end of the file and then interpret the contents as a Float64, returning `val, isnull` That's pretty much it. One of the most critical performance keys for both SQLite and CSV.read is non-copying strings once the file has been mmapped. For SQLite, the sqlite3_bind_text library method actually has a flag to indicate whether the text should be copied or not, so we're able to pass the pointer to the position in the mmapped array directly. For the CSV.read method, which returns a Vector of the columns (as typed arrays), I've actually rolled a quick and dirty CString type that looks like immutable CString ptr::Ptr{UInt8} len::Int end With a few extra method definitions, this type looks very close to a real string type, but we can construct it by pointing directly to the mmapped region (which currently isn't possible for native Julia string types). See https://github.com/quinnj/Strings.jl for more brainstorming around this alternative string implementation. You can convert a CString to a Julia string by calling string(x::CString) or map(string,column) for an Array of CSV.CStrings. As an update on the performance on the Facebook Kaggle competition bids.csv file: -readcsv: 45 seconds, 33% gc time -CSV.read: 19 seconds, 3% gc time -SQLite.create: 25 seconds, 3.25% gc time Anyway, hopefully I'll get around to cleaning up CSV.jl to be released officially, but it's that last 10-20% that's always the hardest to finish up :) -Jacob On Mon, Jun 1, 2015 at 4:25 PM, David Gold <[email protected]> wrote: > @Jacob I'm just developing a working understanding of these issues. Would > you please help me to get a better handle on your solution? > > My understanding thus far: Reading a (local) .csv file into a DataFrame > using DataFrames.readtable involves reading the file into an IOStream and > then parsing that stream into a form amenable to parsing by > DataFrames.builddf, which builds the DataFrame object returned by > readtable. The work required to get the contents of the .csv file into > memory in a form that can be manipulated by Julia functions is > work-intensive in this manner. However, with SQLite, the entire file can > just be thrown into memory wholesale, along with some metadata (maybe not > the right term?) that delineates the tabular properties of the data. > > What I am curious about, then (if this understanding is not too > misguided), is how SQLite returns, say, a column of data that doesn't > include, say, a bunch of delimiters. That is, what sort of parsing *does* > SQLite do, and when? > > On Monday, June 1, 2015 at 1:48:16 PM UTC-4, Jacob Quinn wrote: >> >> The biggest single advantage SQLite has is the ability to mmap a file and >> just tell SQLite which pointer addresses start strings and how long they >> are, all without copying. The huge, huge bottleneck in most >> implementations, is not just identifying where a string starts and how long >> it is, but then allocating "in program" memory and copying the string into >> it. With SQLite, we can use an in-memory database, mmap the file, and tell >> SQLite where each string for a column lives by giving it the starting >> pointer address and how long it is. I've been looking into how to solve >> this problem over the last month or so (apart from Oscar's gc wizardry) and >> it just occurred to me last week that using SQLite may be the best way; so >> far, the results are promising! >> >> -Jacob >> >> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected]> wrote: >> >>> Great, thank you Jacob, I will try it out! >>> >>> Do you have a writeup on differences in the way you read CSV files and >>> the way it is currently done in Julia? Would love to know more! >>> >>> Obvious perhaps but for completeness: Reading the data using readcsv or >>> readdlm does not improve much the metrics I reported, suggesting that the >>> overhead from DataFrames is not much. >>> >>> Thank you again! >>> >>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote: >>>> >>>> I've been meaning to clean some things up and properly release the >>>> functionality, but I have a new way to read in CSV files that beats >>>> anything else out there that I know of. To get the functionality, you'll >>>> need to be running 0.4 master, then do >>>> >>>> Pkg.add("SQLite") >>>> Pkg.checkout("SQLite","jq/updates") >>>> Pkg.clone("https://github.com/quinnj/CSV.jl") >>>> Pkg.clone("https://github.com/quinnj/Mmap.jl") >>>> >>>> I then ran the following on the bids.csv file >>>> >>>> using SQLite, CSV >>>> >>>> db = SQLite.SQLiteDB() >>>> >>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv") >>>> >>>> @time lines = SQLite.create(db, ff,"temp2") >>>> >>>> It took 18 seconds on my newish MBP. From the R data.table package, the >>>> `fread` is the other fastest CSV I know of and it took 34 seconds on my >>>> machine. I'm actually pretty surprised by that, since in other tests I've >>>> done it was on par with the SQLite+CSV or sometimes slightly faster. >>>> >>>> Now, you're not necessarily getting a Julia structure in this case, but >>>> it's loading the data into an SQLite table, that you can then run >>>> SQLite.query(db, sql_string) to do manipulations and such. >>>> >>>> -Jacob >>>> >>>> >>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote: >>>> >>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not mention >>>>> in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) >>>>> Commit 80dd75c* (1 day old release-0.3). >>>>> >>>>> I tried other releases as Tim suggested: >>>>> >>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 >>>>> days old master), >>>>> the same command takes 14 minutes - half that it was taking with >>>>> release-0.3 but still 3 times more than that taken by R's read.csv (5 >>>>> min). >>>>> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB) >>>>> output of the command `@time DataFrames.readtable("bids.csv");` is >>>>> 857.120 seconds (352 M allocations: 16601 MB, 71.59% gc time) # >>>>> reduced from 85% to 71% >>>>> >>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) >>>>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time >>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the >>>>> macro is: >>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc >>>>> time in 33 pauses with 31 full sweep) >>>>> The process also takes up 8GB memory on the machine, more than the >>>>> earlier one. My machine has also significantly slowed down - so perhaps >>>>> the >>>>> increase in memory when compared to release-0.3 is significant. >>>>> >>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so its >>>>> not an option for now. >>>>> >>>>> Is this the best one can expect for now? I read the discussion on >>>>> issue #10428 but I did not understand it well :-( >>>>> >>>>> Thank you! >>>>> >>>>> >>>>> >>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote: >>>>>> >>>>>> Not ideal, but for now you can try turning off the garbage collection >>>>>> while reading in the DataFrame. >>>>>> >>>>>> gc_disable() >>>>>> df = DataFrames.readtable("bids.csv") >>>>>> gc_enable() >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jiahao Chen >>>>>> Research Scientist >>>>>> MIT CSAIL >>>>>> >>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote: >>>>>> >>>>>>> If you're using julia 0.3, you might want to try current master >>>>>>> and/or >>>>>>> possibly the "ob/gctune" branch. >>>>>>> >>>>>>> https://github.com/JuliaLang/julia/issues/10428 >>>>>>> >>>>>>> Best, >>>>>>> --Tim >>>>>>> >>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote: >>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with >>>>>>> 9 columns >>>>>>> > (mostly >>>>>>> > strings). >>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data >>>>>>> > >>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and the >>>>>>> resulting >>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my >>>>>>> machine) >>>>>>> > >>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0 >>>>>>> > >>>>>>> > user system elapsed >>>>>>> > 332.295 4.154 343.332 >>>>>>> > >>>>>>> > > object.size(a) >>>>>>> > >>>>>>> > 601496056 bytes #(0.6 GB) >>>>>>> > >>>>>>> > Loading the same dataset using DataFrames' readtable takes about >>>>>>> 30 minutes >>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and the >>>>>>> resulting >>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same >>>>>>> machine) >>>>>>> > >>>>>>> > (I added couple of calls to @time macro inside the readtable >>>>>>> function to >>>>>>> > see whats taking time - outcomes of these calls too are below) >>>>>>> > >>>>>>> > julia> @time DataFrames.readtable("bids.csv"); >>>>>>> > WARNING: Begin readnrows call >>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, >>>>>>> 0.35% gc >>>>>>> > time) >>>>>>> > WARNING: End readnrows call >>>>>>> > WARNING: Begin builddf call >>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes >>>>>>> allocated, 85.54% >>>>>>> > gc time) >>>>>>> > WARNING: End builddf call >>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, >>>>>>> 84.12% >>>>>>> > gc time) #total time for loading >>>>>>> > >>>>>>> > >>>>>>> > Can you please suggest how I can improve load time and memory >>>>>>> usage in >>>>>>> > DataFrames for sizes this big and bigger? >>>>>>> > >>>>>>> > Thank you! >>>>>>> >>>>>>> >>>>>> >>>> >>
