Re: [Rpy] Making dataframes ... fast

Laurent Gautier Wed, 30 Sep 2009 00:43:38 -0700

Gary,

The "wrong" order (transposed) is for the creation of a data.frame, 
which is distinct from reading the information needed to create a 
data.frame from a file in which each row is a represented by a line.


In R, the functions read.table, read.csv, read.delim, etc... are doing 
the transposition job (if you ever worked with very large such files, 
especially the ones with a very large number of columns, you will have 
also noticed that they are not fast in all situations). This is one of 
the many cases where R is making operations seem natural and obvious 
(while they are sometimes not computationally ideal - R is good for that 
... and R is bad for that).

Suggestions given by Nathaniel and myself to transpose the data in 
Python and/or numpy should have comparable efficiencies for most of of 
the case, and comparable efficiencies with R's read.table.

For the NA case, you may also push more of the fetching of your data to 
R (you have not detailed how you are getting your data); this would save 
you a loop to check for NAs.

In addition to that, and as Nathaniel mentioned it, creating the 
data.frame might not be the most time-consuming step in your pipeline.
Going for parallel processing might be something you want to consider).


Laurent


PS: Regarding the different flavour of NAs, rpy2-2.0.7 still has issues 
with the ones exposed from rpy2.rinterface, while they are long fixed in 
rpy2-2.1-dev (the little time/resources I go into 2.1-dev). You can 
nevertheless get them a number of easy ways (e.g. NA_real = 
rpy2.robjects.r('NA_real_')[0] ).






Gary Strangman wrote:
> Very helpful, thanks!
> 
> As for having data in the "wrong" order, it's a little odd that a datafile 
> that's perfect for loading into R as a dataframe (via read.table), is 
> inherently in the "wrong" order for dataframe creation after reading it 
> into python (using numpy.genfromtext(), or f.readlines() or whatever).
> 
> As for NAs, those I can do when I set up my data prior to looping. So, 
> those shouldn't be a problem. THanks for giving me a clue on what to use 
> in python/rnumpy to get a proper NA conversion ... that's been far from 
> obvious.
> 
> I'll head off to do some tinkering and profiling now.
> 
> -best
> Gary
> 
> On Tue, 29 Sep 2009, Nathaniel Smith wrote:
> 
>> On Tue, Sep 29, 2009 at 4:21 AM, Gary Strangman
>> <str...@nmr.mgh.harvard.edu> wrote:
>>> Without benchmarking, that seems mighty inefficient. Nathaniel Smith's
>>> rnumpy mostly allows the following:
>>>
>>> df = rnumpy.r.data_frame(numpy.array(d,np.object))
>>>
>>> ... which is 2 conversions (rather than 4), but I haven't been able to get
>>> the column names attached in this case. (My inexperience, I'm sure.)
>> Something like
>>  d_array = np.array(d, dtype=object)
>>  named_columns = dict([(name, d_array[:, i]) for i, name in
>> enumerate(colnames)])
>>  df = rnumpy.r.data_frame(**named_columns)
>> should work, and doesn't add any extra data copies compared to what
>> you had (because numpy slicing is cheap).
>>
>> Fundamentally there isn't that much you can do to make this
>> crazy-fast, though. You have a jumble of individual Python objects
>> laid out in the wrong way in memory, and that's going to take some
>> Python-land fiddling around to get that straightened out. You have to
>> transpose the data somehow -- I don't know whether a Python loop or
>> copying the data into a np.array is faster. Then you have to detect
>> the Python object types and convert them into R. rnumpy uses numpy's
>> type detection code; this code is in C and quite fast, but it also has
>> to be conservative and check the entire column to make sure that
>> everything is of the same type, and it copies the data into a np.array
>> again before calling rinterface.*SexpVector -- you might be able to
>> beat it since you know your data is uniform and can construct the
>> SexpVector directly from the list/object array.
>>
>> Also, if you need proper NA handling, you need another pass to handle
>> that, and making that pass work may defeat all your optimization
>> tricks. How exactly this should work depends on how you've coded NA's
>> in your data. (I see from your example that you use np.nan for a float
>> column, but what about other data types?) Note that np.nan is not the
>> same as R's NA. (However, you can pull a float representing NA out of
>> R and use it when building your data structures in the first place --
>> in rnumpy it's rnumpy.NA_NUMERIC[0]. It will look like a NaN to
>> Python, but convert properly.)
>>
>> I'd suggest writing something that works, getting the analysis
>> pipeline running, and then profiling to see if this part matters...
>>
>> -- Nathaniel
>>
>> ------------------------------------------------------------------------------
>> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
>> is the only developer event you need to attend this year. Jumpstart your
>> developing skills, take BlackBerry mobile applications to market and stay
>> ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
>> http://p.sf.net/sfu/devconf
>> _______________________________________________
>> rpy-list mailing list
>> rpy-list@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rpy-list
>>
>>
>>
> 
> ------------------------------------------------------------------------------
> Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
> is the only developer event you need to attend this year. Jumpstart your
> developing skills, take BlackBerry mobile applications to market and stay 
> ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
> http://p.sf.net/sfu/devconf
> _______________________________________________
> rpy-list mailing list
> rpy-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rpy-list


------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
rpy-list mailing list
rpy-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rpy-list

Re: [Rpy] Making dataframes ... fast

Reply via email to