Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Antonio Piccolboni Tue, 01 May 2012 10:30:23 -0700

It seems like people need to hear more context, happy to provide it. I am
implementing a serialization format (typedbytes, HADOOP-1722 if people want
the gory details) to make R and Hadoop interoperate better (RHadoop
project, package rmr). It is a row first format and it's already
implemented as a C extension for R for lists and atomic vectors, where each
element  of a vector is a row. I need to extend it to accept data frames
and I was wondering if I can use the existing C code by converting a data
frame to a list of its rows. It sounds like the answer is that it is not a
good idea, that's helpful too in a way because it restricts the options. I
thought I may be missing a simple primitive, like a t() for data frames
(that doesn't coerce to matrix). Thanks


Antonio

On Tue, May 1, 2012 at 5:46 AM, Prof Brian Ripley <rip...@stats.ox.ac.uk>wrote:

> On 01/05/2012 00:28, Antonio Piccolboni wrote:
>
>> Hi,
>> I was wondering if there is anything more efficient than split to do the
>> kind of conversion in the subject. If I create a data frame as in
>>
>> system.time({fd =  data.frame(x=1:2000, y = rnorm(2000), id = paste("x",
>> 1:2000, sep =""))})
>>   user  system elapsed
>>   0.004   0.000   0.004
>>
>> and then I try to split it
>>
>>  system.time(split(fd, 1:nrow(fd)))
>>>
>>    user  system elapsed
>>   0.333   0.031   0.415
>>
>>
>> You will be quick to notice the roughly two orders of magnitude difference
>> in time between creation and conversion. Granted, it's not written
>> anywhere
>>
>
> Unsurprising when you create three orders of magnitude more data frames,
> is it?  That's a list of 2000 data frames.  Try
>
> system.time(for(i in 1:2000) data.frame(x = i, y = rnorm(1), id =
> paste0("x", i)))
>
>
>
>  that they should be similar but the latter seems interpreter-slow to me
>> (split is implemented with a lapply in the data frame case) There is also
>> a
>> memory issue when I hit about 20000 elements (allocating 3GB when
>> interrupted). So before I resort to Rcpp, despite the electrifying feeling
>> of approaching the bare metal and for the sake of getting things done, I
>> thought I would ask the experts. Thanks
>>
>
> You need to re-think your data structures: 1-row data frames are not
> sensible.
>
>
>
>>
>> Antonio
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> ______________________________**________________
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
>>
>
>
> --
> Brian D. Ripley,                  rip...@stats.ox.ac.uk
> Professor of Applied Statistics,  
> http://www.stats.ox.ac.uk/~**ripley/<http://www.stats.ox.ac.uk/~ripley/>
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>

        [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] fast version of split.data.frame or conversion from data.frame to list of its rows

Reply via email to