I agree, I would think this could be done nicely in pure Julia, just using 
those techniques that make SQLLite fast...

On Monday, June 1, 2015 at 9:01:59 PM UTC+2, verylucky Man wrote:
>
> Thank you Jacob for detailed explanation!
> Why can't one do something similar with Julia structures (instead of 
> SQLite)? Sorry for asking what may be very basic questions.
>
> Thank you!
>
> On Mon, Jun 1, 2015 at 1:47 PM, Jacob Quinn <[email protected] 
> <javascript:>> wrote:
>
>> The biggest single advantage SQLite has is the ability to mmap a file and 
>> just tell SQLite which pointer addresses start strings and how long they 
>> are, all without copying. The huge, huge bottleneck in most 
>> implementations, is not just identifying where a string starts and how long 
>> it is, but then allocating "in program" memory and copying the string into 
>> it. With SQLite, we can use an in-memory database, mmap the file, and tell 
>> SQLite where each string for a column lives by giving it the starting 
>> pointer address and how long it is. I've been looking into how to solve 
>> this problem over the last month or so (apart from Oscar's gc wizardry) and 
>> it just occurred to me last week that using SQLite may be the best way; so 
>> far, the results are promising!
>>
>> -Jacob
>>
>> On Mon, Jun 1, 2015 at 11:40 AM, <[email protected] <javascript:>> 
>> wrote:
>>
>>> Great, thank you Jacob, I will try it out! 
>>>
>>> Do you have a writeup on differences in the way you read CSV files and 
>>> the way it is currently done in Julia? Would love to know more!
>>>
>>> Obvious perhaps but for completeness: Reading the data using readcsv or 
>>> readdlm does not improve much the metrics I reported, suggesting that the 
>>> overhead from DataFrames is not much.
>>>
>>> Thank you again!
>>>
>>> On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>>>>
>>>> I've been meaning to clean some things up and properly release the 
>>>> functionality, but I have a new way to read in CSV files that beats 
>>>> anything else out there that I know of. To get the functionality, you'll 
>>>> need to be running 0.4 master, then do
>>>>
>>>> Pkg.add("SQLite")
>>>> Pkg.checkout("SQLite","jq/updates")
>>>> Pkg.clone("https://github.com/quinnj/CSV.jl";)
>>>> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>>>>
>>>> I then ran the following on the bids.csv file
>>>>
>>>> using SQLite, CSV
>>>>
>>>> db = SQLite.SQLiteDB()
>>>>
>>>> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>>>>
>>>> @time lines = SQLite.create(db, ff,"temp2")
>>>>
>>>> It took 18 seconds on my newish MBP. From the R data.table package, the 
>>>> `fread` is the other fastest CSV I know of and it took 34 seconds on my 
>>>> machine. I'm actually pretty surprised by that, since in other tests I've 
>>>> done it was on par with the SQLite+CSV or sometimes slightly faster.
>>>>
>>>> Now, you're not necessarily getting a Julia structure in this case, but 
>>>> it's loading the data into an SQLite table, that you can then run 
>>>> SQLite.query(db, sql_string) to do manipulations and such.
>>>>
>>>> -Jacob
>>>>
>>>>
>>>> On Sun, May 31, 2015 at 9:42 PM, <[email protected]> wrote:
>>>>
>>>>> Thank you Tim and Jiahao for your responses. Sorry, I did not mention 
>>>>> in my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) 
>>>>> Commit 80dd75c* (1 day old release-0.3).
>>>>>
>>>>> I tried other releases as Tim suggested:
>>>>>
>>>>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 
>>>>> days old master), 
>>>>> the same command takes 14 minutes - half that it was taking with 
>>>>> release-0.3 but still 3 times more than that taken by R's read.csv (5 
>>>>> min). 
>>>>> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
>>>>> output of the command `@time DataFrames.readtable("bids.csv");` is
>>>>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc time) # 
>>>>> reduced from 85% to 71%
>>>>>
>>>>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) 
>>>>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time 
>>>>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the 
>>>>> macro is: 
>>>>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc 
>>>>> time in 33 pauses with 31 full sweep)
>>>>> The process also takes up 8GB memory on the machine, more than the 
>>>>> earlier one. My machine has also significantly slowed down - so perhaps 
>>>>> the 
>>>>> increase in memory when compared to release-0.3 is significant.
>>>>>
>>>>> On disabling gc, my machine (4GB laptop) goes soul searching; so its 
>>>>> not an option for now.
>>>>>
>>>>> Is this the best one can expect for now? I read the discussion on 
>>>>> issue #10428 but I did not understand it well :-(
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>>
>>>>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>>>>
>>>>>> Not ideal, but for now you can try turning off the garbage collection 
>>>>>> while reading in the DataFrame.
>>>>>>
>>>>>> gc_disable()
>>>>>> df = DataFrames.readtable("bids.csv")
>>>>>> gc_enable()
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jiahao Chen
>>>>>> Research Scientist
>>>>>> MIT CSAIL
>>>>>>
>>>>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote:
>>>>>>
>>>>>>> If you're using julia 0.3, you might want to try current master 
>>>>>>> and/or
>>>>>>> possibly the "ob/gctune" branch.
>>>>>>>
>>>>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>>>>
>>>>>>> Best,
>>>>>>> --Tim
>>>>>>>
>>>>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
>>>>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 
>>>>>>> 9 columns
>>>>>>> > (mostly
>>>>>>> > strings). 
>>>>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>>>>> >
>>>>>>> > Loading the dataset in R using read.csv takes 5 minutes and the 
>>>>>>> resulting
>>>>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my 
>>>>>>> machine)
>>>>>>> >
>>>>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>>>>> >
>>>>>>> > user   system elapsed
>>>>>>> > 332.295   4.154 343.332
>>>>>>> >
>>>>>>> > > object.size(a)
>>>>>>> >
>>>>>>> > 601496056 bytes #(0.6 GB)
>>>>>>> >
>>>>>>> > Loading the same dataset using DataFrames' readtable takes about 
>>>>>>> 30 minutes
>>>>>>> > on the same machine (varies a bit, lowest is 25 minutes) and the 
>>>>>>> resulting
>>>>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same 
>>>>>>> machine)
>>>>>>> >
>>>>>>> > (I added couple of calls to @time macro inside the readtable 
>>>>>>> function to
>>>>>>> > see whats taking time - outcomes of these calls too are below)
>>>>>>> >
>>>>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>>>>> > WARNING: Begin readnrows call
>>>>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 
>>>>>>> 0.35% gc
>>>>>>> > time)
>>>>>>> > WARNING: End readnrows call
>>>>>>> > WARNING: Begin builddf call
>>>>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes 
>>>>>>> allocated, 85.54%
>>>>>>> > gc time)
>>>>>>> > WARNING: End builddf call
>>>>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 
>>>>>>> 84.12%
>>>>>>> > gc time) #total time for loading
>>>>>>> >
>>>>>>> >
>>>>>>> > Can you please suggest how I can improve load time and memory 
>>>>>>> usage in
>>>>>>> > DataFrames for sizes this big and bigger?
>>>>>>> >
>>>>>>> > Thank you!
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>

Reply via email to