The biggest speed tweak is to pass in the colClasses argument in read.csv. I have a little function that reads the first N lines, guesses the column types based on that, and passes that into read.csv to read the full file. This is much faster than the defaults.
Also specifying a guess at number of rows (but I don't think that can be made generic), and specifying comment.char="". See: http://www.biostat.jhsph.edu/~rpeng/docs/R-large-tables.html I can't time the differences because I'm not on my normal machine at the moment. But looking forward to this. On 7 October 2011 10:58, Matthew Dowle <[email protected]> wrote: > Yes, single delimiter files too. Yes it should be faster than normal speed > tweaks > on read.table. > > One (very very basic) test so far has shown 4 times faster for a 7.5MB file > on disk (5.5s > down to 1.3s). The code and test is already in the package (so you can run > that test now), > see data.table:::read (3 colons), and the 2 source files : > https://r-forge.r-project.org/scm/viewvc.php/pkg/R/read.R?view=markup&root=datatable > https://r-forge.r-project.org/scm/viewvc.php/pkg/src/readfile.c?view=markup&root=datatable > > But, it doesn't look like I did the speed tweaks for read.csv in that > comparison. What are > they again? Any help with this feature would be great. > > Matthew > > "Chris Neff" <[email protected]> wrote in message > news:caauy0rvowwfgbcprcod6gt4goamozguexqslios3qhufhf0...@mail.gmail.com... > On 6 October 2011 00:15, Matthew Dowle <[email protected]> wrote: >> Indeed. Or columns 11 and 12 of BED files (genomics). Near on the agenda >> is a fast file loader straight into data.table and list columns >> (dual-delimited files such as BED). >> > > Is this a fast file loader for any files that could be read using > read.table, or just dual delimited files? If you can make a way to > load things that is faster than read.table with the normal speed > tweaks that get mentioned for it, I'd be ecstatic. > > >> I don't believe SQL has an analogous concept to list columns? To achieve >> that people may be using comma delimited strings in varchar columns, I >> guess. >> >> On Wed, 2011-10-05 at 16:19 -0500, Branson Owen wrote: >>> Thank you very, very much Matthew. I think this is a very valuable (at >>> least to me), and unique feature for more powerful calculation. A very >>> useful application I can immediately think of is for options chains >>> and order book modeling. It's much easier to track and model the whole >>> option chains or order book for each time stamp or symbol, and also >>> save a lot of replicating time stamps and symbols. >>> >>> 2011/10/4 Matthew Dowle <[email protected]> >>> On Sun, 2011-10-02 at 15:14 +0800, Branson Owen wrote: >>> >>> > Oh, sorry, I was testing the syntax like: >>> > >>> > DT = data.table(A = 1:2, B = list('a', 2i)) >>> > >>> > It didn't work, and I though this feature has not been >>> implemented. >>> > Thank you for pointing it out with a good example. >>> >>> >>> Natural to assume that should work. Now in 1.6.7 : >>> >>> o data.table() now accepts list columns directly rather than >>> needing to add list columns to an existing data.table; >>> e.g., >>> >>> DT = data.table(x=1:3,y=list(4:6,3.14,matrix(1:12,3))) >>> >>> Thanks to Branson Owen for reminding. >>> >>> Accordingly, one item has been added to FAQ 2.17 >>> (differences >>> between data.frame and data.table) : >>> "data.frame(list(1:2,"k",1:4)) >>> creates 3 columns, data.table creates one list column" >>> >>> As before, list columns can be created via grouping; e.g., >>> >>> DT = data.table(x=c(1,1,2,2,2,3,3),y=1:7) >>> DT2 = DT[,list(list(unique(y))),by=x] >>> DT2 >>> x V1 >>> [1,] 1 1, 2 >>> [2,] 2 3, 4, 5 >>> [3,] 3 6, 7 >>> >>> and list columns can be grouped; e.g., >>> >>> DT2[,sum(unlist(V1)),by=list(x%%2)] >>> x V1 >>> [1,] 1 16 >>> [2,] 0 12 >>> >>> >>> >>> >>> >> >> >> _______________________________________________ >> datatable-help mailing list >> [email protected] >> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> > > > > _______________________________________________ > datatable-help mailing list > [email protected] > https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help > _______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
