Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Antonio Piccolboni Tue, 01 May 2012 14:02:36 -0700

On Tue, May 1, 2012 at 11:29 AM, Simon Urbanek
<simon.urba...@r-project.org>wrote:


>
> On May 1, 2012, at 1:26 PM, Antonio Piccolboni <anto...@piccolboni.info>
> wrote:
>
> > It seems like people need to hear more context, happy to provide it. I am
> > implementing a serialization format (typedbytes, HADOOP-1722 if people
> want
> > the gory details) to make R and Hadoop interoperate better (RHadoop
> > project, package rmr). It is a row first format and it's already
> > implemented as a C extension for R for lists and atomic vectors, where
> each
> > element  of a vector is a row. I need to extend it to accept data frames
> > and I was wondering if I can use the existing C code by converting a data
> > frame to a list of its rows. It sounds like the answer is that it is not
> a
> > good idea,
>
> Just think about it -- data frames are lists of *columns* because the type
> of each column is fixed. Treating them row-wise is extremely inefficient,
> because you can't use any vector type to represent such thing (other than a
> generic vector containing vectors of length 1).
>

Thanks, let's say this together with the experiments and other converging
opinions lays the question to rest.


>  > that's helpful too in a way because it restricts the options. I
> > thought I may be missing a simple primitive, like a t() for data frames
> > (that doesn't coerce to matrix).
>
> See above - I think you are misunderstanding data frames - t() makes no
> sense for data frames.
>

I think you are misunderstanding my use of t(). Thanks


Antonio


>
> Cheers,
> Simon
>
>
>
> > On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley <rip...@stats.ox.ac.uk
> >wrote:
> >
> >> On 01/05/2012 00:28, Antonio Piccolboni wrote:
> >>
> >>> Hi,
> >>> I was wondering if there is anything more efficient than split to do
> the
> >>> kind of conversion in the subject. If I create a data frame as in
> >>>
> >>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id =
> paste("x",
> >>> 1:2000, sep =""))})
> >>>  user  system elapsed
> >>>  0.004   0.000   0.004
> >>>
> >>> and then I try to split it
> >>>
> >>> system.time(split(fd, 1:nrow(fd)))
> >>>>
> >>>   user  system elapsed
> >>>  0.333   0.031   0.415
> >>>
> >>>
> >>> You will be quick to notice the roughly two orders of magnitude
> difference
> >>> in time between creation and conversion. Granted, it's not written
> >>> anywhere
> >>>
> >>
> >> Unsurprising when you create three orders of magnitude more data frames,
> >> is it?  That's a list of 2000 data frames.  Try
> >>
> >> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
> >> paste0("x", i)))
> >>
> >>
> >>
> >> that they should be similar but the latter seems interpreter-slow to me
> >>> (split is implemented with a lapply in the data frame case) There is
> also
> >>> a
> >>> memory issue when I hit about 20000 elements (allocating 3GB when
> >>> interrupted). So before I resort to Rcpp, despite the electrifying
> feeling
> >>> of approaching the bare metal and for the sake of getting things done,
> I
> >>> thought I would ask the experts. Thanks
> >>>
> >>
> >> You need to re-think your data structures: 1-row data frames are not
> >> sensible.
> >>
> >>
> >>
> >>>
> >>> Antonio
> >>>
> >>>       [[alternative HTML version deleted]]
> >>>
> >>>
> >>> ______________________________**________________
> >>> R-devel@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/**listinfo/r-devel<
> https://stat.ethz.ch/mailman/listinfo/r-devel>
> >>>
> >>
> >>
> >> --
> >> Brian D. Ripley,                  rip...@stats.ox.ac.uk
> >> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/<
> http://www.stats.ox.ac.uk/~ripley/>
> >> University of Oxford,             Tel:  +44 1865 272861 (self)
> >> 1 South Parks Road,                     +44 1865 272866 (PA)
> >> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Reply via email to