Filled as #2605 About your ultimate goal... why would you want on-disk tables rather than RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be quicker ? I think data.table::fread is priceless because it is way faster than any other read function. I just benchmarked fread reading a csv file against R loading its own .RData binary format, and shockingly fread is much faster! I think it is too bad R doesn't provide a very fast way of loading objects saved from a previous R session (well why don't I do it if it is so easy...)
2013/3/11 stat quant <[email protected]> > On my way to fill it in. > > About your ultimate goal... why would you want on-disk tables rather than > RAM (apart from being able to read >RAM limit file) ? Wouldnt RAM always be > quicker ? > > I think data.table::fread is priceless because it is way faster than any > other read function. > I just benchmarked fread reading a csv file against R loading its own > .RData binary format, and shockingly fread is much faster! > I think it is too bad R doesn't provide a very fast way of loading objects > saved from a previous R session (well why don't I do it if it is so easy...) > > > > 2013/3/11 Matthew Dowle <[email protected]> > >> ** >> >> >> >> Good idea statquant, please file it then. How about something more >> general e.g. >> >> fread(input, chunk.nrows=10000, chunk.filter = <anything acceptable >> to i of DT[i]>) >> >> That <anything> could be grep() or any expression of column names. It >> wouldn't be efficient to call that for every row one by one and similarly >> couldn't be called for the whole DT, since the point is that DT is greater >> than RAM. So some batch size need be defined hence chunk.nrows=10000. >> That filter would then be called for each chunk and any rows passing would >> make it into the final table. >> >> read.ffdf has something like this I believe, and Jens already suggested >> that when I ran the timings in example(fread) past him. We should probably >> follow his lead on that in terms of argument names etc. >> >> Perhaps chunk should be defined in terms of RAM e.g. chunk=100MB. Since >> that is how it needs to be internally, in terms of number of pages to map. >> Or maybe both as nrows or MB would be acceptable. >> >> Ultimately (maybe in 5 years!) we're heading towards fread reading into >> on-disk tables rather than RAM. Filtering in chunks will always be a good >> option to have though, even then, as you might want to filter what makes it >> to the on-disk table. >> >> Matthew >> >> >> >> On 11.03.2013 12:53, MICHELE DE MEO wrote: >> >> Very interesting request. I also would be interested in this possibility. >> Cheers >> >> >> 2013/3/11 stat quant <[email protected]> >> >>> Hello list, >>> We like FREAD because it is very fast, yet sometimes files are huge and >>> R cannot handle that much data, some packages handle this limitation but >>> they do not provide a similar to fread function. >>> Yet sometimes only subsets of a file is really needed, subsets that >>> could fit into RAM. >>> >>> So what about adding a grep option to fread that would allow to load >>> only lines that matches a regular expression? >>> >>> I'll add a request if you think the idea is worth implementing. >>> >>> Cheers >>> >>> >>> _______________________________________________ >>> datatable-help mailing list >>> [email protected] >>> >>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help >> >> >> >> -- >> *************************************************************** >> *Michele De Meo, Ph.D* >> *Statistical and data mining solutions >> http://micheledemeo.blogspot.com/ >> skype: demeo.michele* >> * >> * >> >> >> >> > >
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
