On May 1, 2012, at 1:26 PM, Antonio Piccolboni <anto...@piccolboni.info> wrote:
> It seems like people need to hear more context, happy to provide it. I am > implementing a serialization format (typedbytes, HADOOP-1722 if people want > the gory details) to make R and Hadoop interoperate better (RHadoop > project, package rmr). It is a row first format and it's already > implemented as a C extension for R for lists and atomic vectors, where each > element of a vector is a row. I need to extend it to accept data frames > and I was wondering if I can use the existing C code by converting a data > frame to a list of its rows. It sounds like the answer is that it is not a > good idea, Just think about it -- data frames are lists of *columns* because the type of each column is fixed. Treating them row-wise is extremely inefficient, because you can't use any vector type to represent such thing (other than a generic vector containing vectors of length 1). > that's helpful too in a way because it restricts the options. I > thought I may be missing a simple primitive, like a t() for data frames > (that doesn't coerce to matrix). See above - I think you are misunderstanding data frames - t() makes no sense for data frames. Cheers, Simon > On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley > <rip...@stats.ox.ac.uk>wrote: > >> On 01/05/2012 00:28, Antonio Piccolboni wrote: >> >>> Hi, >>> I was wondering if there is anything more efficient than split to do the >>> kind of conversion in the subject. If I create a data frame as in >>> >>> system.time({fd = data.frame(x=1:2000, y = rnorm(2000), id = paste("x", >>> 1:2000, sep =""))}) >>> user system elapsed >>> 0.004 0.000 0.004 >>> >>> and then I try to split it >>> >>> system.time(split(fd, 1:nrow(fd))) >>>> >>> user system elapsed >>> 0.333 0.031 0.415 >>> >>> >>> You will be quick to notice the roughly two orders of magnitude difference >>> in time between creation and conversion. Granted, it's not written >>> anywhere >>> >> >> Unsurprising when you create three orders of magnitude more data frames, >> is it? That's a list of 2000 data frames. Try >> >> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id = >> paste0("x", i))) >> >> >> >> that they should be similar but the latter seems interpreter-slow to me >>> (split is implemented with a lapply in the data frame case) There is also >>> a >>> memory issue when I hit about 20000 elements (allocating 3GB when >>> interrupted). So before I resort to Rcpp, despite the electrifying feeling >>> of approaching the bare metal and for the sake of getting things done, I >>> thought I would ask the experts. Thanks >>> >> >> You need to re-think your data structures: 1-row data frames are not >> sensible. >> >> >> >>> >>> Antonio >>> >>> [[alternative HTML version deleted]] >>> >>> >>> ______________________________**________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel> >>> >> >> >> -- >> Brian D. Ripley, rip...@stats.ox.ac.uk >> Professor of Applied Statistics, >> http://www.stats.ox.ac.uk/~**ripley/<http://www.stats.ox.ac.uk/~ripley/> >> University of Oxford, Tel: +44 1865 272861 (self) >> 1 South Parks Road, +44 1865 272866 (PA) >> Oxford OX1 3TG, UK Fax: +44 1865 272595 >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel