Great, thank you Jacob, I will try it out! 

Do you have a writeup on differences in the way you read CSV files and the 
way it is currently done in Julia? Would love to know more!

Obvious perhaps but for completeness: Reading the data using readcsv or 
readdlm does not improve much the metrics I reported, suggesting that the 
overhead from DataFrames is not much.

Thank you again!

On Monday, June 1, 2015 at 1:06:50 PM UTC-4, Jacob Quinn wrote:
>
> I've been meaning to clean some things up and properly release the 
> functionality, but I have a new way to read in CSV files that beats 
> anything else out there that I know of. To get the functionality, you'll 
> need to be running 0.4 master, then do
>
> Pkg.add("SQLite")
> Pkg.checkout("SQLite","jq/updates")
> Pkg.clone("https://github.com/quinnj/CSV.jl";)
> Pkg.clone("https://github.com/quinnj/Mmap.jl";)
>
> I then ran the following on the bids.csv file
>
> using SQLite, CSV
>
> db = SQLite.SQLiteDB()
>
> ff = CSV.File("/Users/jacobquinn/Downloads/bids.csv")
>
> @time lines = SQLite.create(db, ff,"temp2")
>
> It took 18 seconds on my newish MBP. From the R data.table package, the 
> `fread` is the other fastest CSV I know of and it took 34 seconds on my 
> machine. I'm actually pretty surprised by that, since in other tests I've 
> done it was on par with the SQLite+CSV or sometimes slightly faster.
>
> Now, you're not necessarily getting a Julia structure in this case, but 
> it's loading the data into an SQLite table, that you can then run 
> SQLite.query(db, sql_string) to do manipulations and such.
>
> -Jacob
>
>
> On Sun, May 31, 2015 at 9:42 PM, <[email protected] <javascript:>> 
> wrote:
>
>> Thank you Tim and Jiahao for your responses. Sorry, I did not mention in 
>> my OP that I was using Version 0.3.10-pre+1 (2015-05-30 11:26 UTC) Commit 
>> 80dd75c* (1 day old release-0.3).
>>
>> I tried other releases as Tim suggested:
>>
>> On Version 0.4.0-dev+5121 (2015-05-31 12:13 UTC) Commit bfa8648* (0 days 
>> old master), 
>> the same command takes 14 minutes - half that it was taking with 
>> release-0.3 but still 3 times more than that taken by R's read.csv (5 min). 
>> More important, Julia process takes up 8GB memory (Rsession takes 1.6GB)
>> output of the command `@time DataFrames.readtable("bids.csv");` is
>> 857.120 seconds      (352 M allocations: 16601 MB, 71.59% gc time) # 
>> reduced from 85% to 71%
>>
>> For completeness, On Version 0.4.0-dev+4451 (2015-04-22 21:55 UTC) 
>> ob/gctune/238ed08* (fork: 1 commits, 39 days), the command `@time 
>> DataFrames.readtable("bids.csv");` takes 21 minutes; the output of the 
>> macro is: 
>> elapsed time: 1303.167204109 seconds (18703 MB allocated, 76.58% gc time 
>> in 33 pauses with 31 full sweep)
>> The process also takes up 8GB memory on the machine, more than the 
>> earlier one. My machine has also significantly slowed down - so perhaps the 
>> increase in memory when compared to release-0.3 is significant.
>>
>> On disabling gc, my machine (4GB laptop) goes soul searching; so its not 
>> an option for now.
>>
>> Is this the best one can expect for now? I read the discussion on issue 
>> #10428 but I did not understand it well :-(
>>
>> Thank you!
>>
>>
>>
>> On Sunday, May 31, 2015 at 9:25:14 PM UTC-4, Jiahao Chen wrote:
>>>
>>> Not ideal, but for now you can try turning off the garbage collection 
>>> while reading in the DataFrame.
>>>
>>> gc_disable()
>>> df = DataFrames.readtable("bids.csv")
>>> gc_enable()
>>>
>>>
>>> Thanks,
>>>
>>> Jiahao Chen
>>> Research Scientist
>>> MIT CSAIL
>>>
>>> On Mon, Jun 1, 2015 at 1:36 AM, Tim Holy <[email protected]> wrote:
>>>
>>>> If you're using julia 0.3, you might want to try current master and/or
>>>> possibly the "ob/gctune" branch.
>>>>
>>>> https://github.com/JuliaLang/julia/issues/10428
>>>>
>>>> Best,
>>>> --Tim
>>>>
>>>> On Sunday, May 31, 2015 09:50:03 AM [email protected] wrote:
>>>> > Facebook's Kaggle competition has a dataset with ~7.6e6 rows with 9 
>>>> columns
>>>> > (mostly
>>>> > strings). 
>>>> https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/data
>>>> >
>>>> > Loading the dataset in R using read.csv takes 5 minutes and the 
>>>> resulting
>>>> > dataframe takes 0.6GB (RStudio takes a total of 1.6GB memory on my 
>>>> machine)
>>>> >
>>>> > >t0 = proc.time(); a = read.csv("bids.csv"); proc.time()-t0
>>>> >
>>>> > user   system elapsed
>>>> > 332.295   4.154 343.332
>>>> >
>>>> > > object.size(a)
>>>> >
>>>> > 601496056 bytes #(0.6 GB)
>>>> >
>>>> > Loading the same dataset using DataFrames' readtable takes about 30 
>>>> minutes
>>>> > on the same machine (varies a bit, lowest is 25 minutes) and the 
>>>> resulting
>>>> > (Julia process, REPL on Terminal, takes 6GB memory on the same 
>>>> machine)
>>>> >
>>>> > (I added couple of calls to @time macro inside the readtable function 
>>>> to
>>>> > see whats taking time - outcomes of these calls too are below)
>>>> >
>>>> > julia> @time DataFrames.readtable("bids.csv");
>>>> > WARNING: Begin readnrows call
>>>> > elapsed time: 29.517358476 seconds (2315258744 bytes allocated, 
>>>> 0.35% gc
>>>> > time)
>>>> > WARNING: End readnrows call
>>>> > WARNING: Begin builddf call
>>>> > elapsed time: 1809.506275842 seconds (18509704816 bytes allocated, 
>>>> 85.54%
>>>> > gc time)
>>>> > WARNING: End builddf call
>>>> > elapsed time: 1840.471467982 seconds (21808681500 bytes allocated, 
>>>> 84.12%
>>>> > gc time) #total time for loading
>>>> >
>>>> >
>>>> > Can you please suggest how I can improve load time and memory usage in
>>>> > DataFrames for sizes this big and bigger?
>>>> >
>>>> > Thank you!
>>>>
>>>>
>>>
>

Reply via email to