Re: [Rpy] Making dataframes ... fast

Laurent Gautier Tue, 29 Sep 2009 04:50:02 -0700

Gary Strangman wrote:
> 
> Hi Laurent,
> 
> The only way to reduce the number of transformations is to add an 
> equivalent number of columns to the dataframe (so that instead of 
> several hundred thousand conversions, I need several hundred thousand 
> columns), and then passing this beast back-and-forth between python and 
> R for regression fits. I figured that was completely untenable.
> 
> Could you give an example of how to convert a variable such as d (below) 
> to a dataframe, with column names in colnames (below), using the 
> rinterface? I'm capable with R and python, but I'm thoroughly confused 
> on what the best order is for conversion. For example, I recall that (at 
> one point anyway) converting a list of lists like mine required 
> something like:
> 
> 1) d_transpose = zip(*d), to change the list-of-lists from row-wise to
>    column-wise
> 
> 2) apply array.array() to each tuple in d_transpose (for which where you
>    need to know the data types in advance)
> 
> 3) loop to make a dictionary out of the colnames plus the list of
>    array.array()s
> 
> 4) pass the dictionary to RDataFrame()




import collections
import rpy2.robjects as ro
import rpy2.rinterface as ri

ColInfo = collections.namedtuple('ColInfo', 'name, rtype')
cols = [ColInfo('code', ri.StrSexpVector),
         ColInfo('pop', ri.StrSexpVector),
         ColInfo('score', ri.FloatSexpVector)]
#note: hopefully numpy.nan and R's _NA_NUMERIC correspond to the same

#the following reshaping is required to "transpose" the list of lists
dr = {}
for col_i, col in enumerate(cols):
     dr[col.name] = col.rtype([x[col_i] for x in d])

#note: loosing the column order below (use alternatives is this matters)
dataf = ri.baseenv['data.frame'](**dr)
#with rpy2-2.0.x
# dataf = ri.baseEnv['data.frame'](**dr)

# make it as a DataFrame (*if needed*)
dataf = ro.conversion.ri2py(dataf)


> Without benchmarking, that seems mighty inefficient. Nathaniel Smith's 
> rnumpy mostly allows the following:
> 
> df = rnumpy.r.data_frame(numpy.array(d,np.object))
> 
> ... which is 2 conversions (rather than 4), but I haven't been able to 
> get the column names attached in this case. (My inexperience, I'm sure.)

Well, well... rnumpy is surely very nice, but it can't do miracles. 
There are necessarily conversions done within those function calls (your 
step 1) above, at very least).

> I'm hoping to make this general-purpose, so I may not know in advance 
> whether column 1 will be a StrSexpVector or an IntSexpVector or 
> whatever.

Knowing the column type in advance will speed things up (see snippet 
above). If not, you have to run a "discover what type" approach.

> Any code-samples (I'm happy to run benchmarks on a set of 
> options and provide them for your site) using the trivial datastructure 
> immediately below would /really/ help.
> 
> d = [['S80', 'C', 137.5, 0, 1],
>      ['S82', 'C', 155.1, 1, 3],
>      ['S83', 'T', 11.96, 0, 5],
>      ['S84', 'T', 47,    1, 6]]
>      ['S85', 'T', numpy.nan, 1, 31]]
> colnames = ['code','pop','score','boolflag','counts']
> 
> -best
> Gary
> 
> On Tue, 29 Sep 2009, Laurent Gautier wrote:
> 
>> Gary,
>>
>> Two things come to my mind:
>>
>> - Try having an initial Python data structure that requires less 
>> transformations than your current one in order to become a DataFrame.
>>
>> - Use rpy2.rinterface when speed matters. This can already get you 
>> faster than R.
>> http://rpy.sourceforge.net/rpy2/doc-dev/html/performances.html
>>
>>
>> L.
>>
>>
>>
>>
>> Gary Strangman wrote:
>>> Hi all,
>>>
>>> I have a python list of lists (each sublist is a row of data), plus a 
>>> list of column names. Something like this ...
>>>
>>>>>> d = [['S80', 'C', 137.5, 0],
>>>           ['S82', 'C', 155.1, 1],
>>>           ['S83', 'T', 11.96, 0],
>>>           ['S84', 'T', 47,    1]]
>>>           ['S85', 'T', numpy.nan, 1]]
>>>>>> colnames = ['code','pop','score','flag']
>>>
>>> I'm looking for the /fastest/ way to create an R dataframe (via rpy2) 
>>> using these two variables. It could be via dictionaries, numpy object 
>>> arrays, whatever, it just needs to be fast. Note that the data has 
>>> mixed types (some columns are strings, some are floats, some are 
>>> ints), and there are missing values which I'd like R to interpret as 
>>> NA. I can pre-transform the elements of the d variable as required to 
>>> facilitate this.
>>>
>>> I need to do this step several hundred thousand times (yes, different 
>>> data each time) on up to ~10,000 element datasets, so any speedup 
>>> suggestions are welcome.
>>>
>>> -best
>>> Gary
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>  
>>>
>>> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
>>> is the only developer event you need to attend this year. Jumpstart your
>>> developing skills, take BlackBerry mobile applications to market and 
>>> stay ahead of the curve. Join us from November 9&#45;12, 2009. 
>>> Register now&#33;
>>> http://p.sf.net/sfu/devconf
>>> _______________________________________________
>>> rpy-list mailing list
>>> rpy-list@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rpy-list
>>
>>
>>


------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
rpy-list mailing list
rpy-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rpy-list

Re: [Rpy] Making dataframes ... fast

Reply via email to