I agree, I would think this could be done nicely in pure Julia, just using those techniques that make SQLLite fast...
On Monday, June 1, 2015 at 9:01:59 PM UTC+2, verylucky Man wrote: > > Thank you Jacob for detailed explanation! > Why can't one do something similar with Julia structures (instead of > SQLite)? Sorry for asking what may be very basic questions. > > Thank you! > > On Mon, Jun 1, 2015 at 1:47 PM, Jacob Quinn <[email protected] > <javascript:>> wrote: > >> The biggest single advantage SQLite has is the ability to mmap a file and >> just tell SQLite which pointer addresses start strings and how long they >> are, all without copying. The huge, huge bottleneck in most >> implementations, is not just identifying where a string starts and how long >> it is, but then allocating "in program" memory and copying the string into >> it. With SQLite, we can use an in-memory database, mmap the file, and tell >> SQLite where each string for a column lives by giving it the starting >> pointer address and how long it is. I've been looking into how to solve >> this problem over the last month or so (apart from Oscar's gc wizardry) and >> it just occurred to me last week that using SQLite may be the best way; so >> far, the results are promising! >> >> -Jacob >> >> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected] <javascript:>> >> wrote: >> >>> Great, thank you Jacob, I will try it out! >>> >>> Do you have a writeup on differences in the way you read CSV files and >>> the way it is currently done in Julia? Would love to know more! >>> >>> Obvious perhaps but for completeness: Reading the data using readcsv or >>> readdlm does not improve much the metrics I reported, suggesting that the >>> overhead from DataFrames is not much. >>> >>> Thank you again! >>> >>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote: >>>> >>>> I've been meaning to clean some things up and properly release the >>>> functionality, but I have a new way to read in CSV files that beats >>>> anything else out there that I know of. To get the functionality, you'll >>>> need to be running 0.4 master, then do >>>> >>>> Pkg.add("SQLite") >>>> Pkg.checkout("SQLite","jq/updates") >>>> Pkg.clone("https://github.com/quinnj/CSV.jl") >>>> Pkg.clone("https://github.com/quinnj/Mmap.jl") >>>> >>>> I then ran the following on the bids.csv file >>>> >>>> using SQLite, CSV >>>> >>>> db = SQLite.SQLiteDB() >>>> >>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv") >>>> >>>> @time lines = SQLite.create(db, ff,"temp2") >>>> >>>> It took 18 seconds on my newish MBP. From the R data.table package, the >>>> `fread` is the other fastest CSV I know of and it took 34 seconds on my >>>> machine. I'm actually pretty surprised by that, since in other tests I've >>>> done it was on par with the SQLite+CSV or sometimes slightly faster. >>>> >>>> Now, you're not necessarily getting a Julia structure in this case, but >>>> it's loading the data into an SQLite table, that you can then run >>>> SQLite.query(db, sql_string) to do manipulations and such. >>>> >>>> -Jacob >>>> >>>> >>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote: >>>> >>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not mention >>>>> in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) >>>>> Commit 80dd75c* (1 day old release-0.3). >>>>> >>>>> I tried other releases as Tim suggested: >>>>> >>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 >>>>> days old master), >>>>> the same command takes 14 minutes - half that it was taking with >>>>> release-0.3 but still 3 times more than that taken by R's read.csv (5 >>>>> min). >>>>> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB) >>>>> output of the command `@time DataFrames.readtable("bids.csv");` is >>>>> 857.120 seconds (352 M allocations: 16601 MB, 71.59% gc time) # >>>>> reduced from 85% to 71% >>>>> >>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) >>>>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time >>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the >>>>> macro is: >>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc >>>>> time in 33 pauses with 31 full sweep) >>>>> The process also takes up 8GB memory on the machine, more than the >>>>> earlier one. My machine has also significantly slowed down - so perhaps >>>>> the >>>>> increase in memory when compared to release-0.3 is significant. >>>>> >>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so its >>>>> not an option for now. >>>>> >>>>> Is this the best one can expect for now? I read the discussion on >>>>> issue #10428 but I did not understand it well :-( >>>>> >>>>> Thank you! >>>>> >>>>> >>>>> >>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote: >>>>>> >>>>>> Not ideal, but for now you can try turning off the garbage collection >>>>>> while reading in the DataFrame. >>>>>> >>>>>> gc_disable() >>>>>> df = DataFrames.readtable("bids.csv") >>>>>> gc_enable() >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jiahao Chen >>>>>> Research Scientist >>>>>> MIT CSAIL >>>>>> >>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote: >>>>>> >>>>>>> If you're using julia 0.3, you might want to try current master >>>>>>> and/or >>>>>>> possibly the "ob/gctune" branch. >>>>>>> >>>>>>> https://github.com/JuliaLang/julia/issues/10428 >>>>>>> >>>>>>> Best, >>>>>>> --Tim >>>>>>> >>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote: >>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with >>>>>>> 9 columns >>>>>>> > (mostly >>>>>>> > strings). >>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data >>>>>>> > >>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and the >>>>>>> resulting >>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my >>>>>>> machine) >>>>>>> > >>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0 >>>>>>> > >>>>>>> > user system elapsed >>>>>>> > 332.295 4.154 343.332 >>>>>>> > >>>>>>> > > object.size(a) >>>>>>> > >>>>>>> > 601496056 bytes #(0.6 GB) >>>>>>> > >>>>>>> > Loading the same dataset using DataFrames' readtable takes about >>>>>>> 30 minutes >>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and the >>>>>>> resulting >>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same >>>>>>> machine) >>>>>>> > >>>>>>> > (I added couple of calls to @time macro inside the readtable >>>>>>> function to >>>>>>> > see whats taking time - outcomes of these calls too are below) >>>>>>> > >>>>>>> > julia> @time DataFrames.readtable("bids.csv"); >>>>>>> > WARNING: Begin readnrows call >>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, >>>>>>> 0.35% gc >>>>>>> > time) >>>>>>> > WARNING: End readnrows call >>>>>>> > WARNING: Begin builddf call >>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes >>>>>>> allocated, 85.54% >>>>>>> > gc time) >>>>>>> > WARNING: End builddf call >>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, >>>>>>> 84.12% >>>>>>> > gc time) #total time for loading >>>>>>> > >>>>>>> > >>>>>>> > Can you please suggest how I can improve load time and memory >>>>>>> usage in >>>>>>> > DataFrames for sizes this big and bigger? >>>>>>> > >>>>>>> > Thank you! >>>>>>> >>>>>>> >>>>>> >>>> >> >
