Re: [Rpy] Making dataframes ... fast

Nathaniel Smith Tue, 29 Sep 2009 13:14:04 -0700

On Tue, Sep 29, 2009 at 4:21 AM, Gary Strangman
<str...@nmr.mgh.harvard.edu> wrote:
> Without benchmarking, that seems mighty inefficient. Nathaniel Smith's
> rnumpy mostly allows the following:
>
> df = rnumpy.r.data_frame(numpy.array(d,np.object))
>
> ... which is 2 conversions (rather than 4), but I haven't been able to get
> the column names attached in this case. (My inexperience, I'm sure.)


Something like
  d_array = np.array(d, dtype=object)
  named_columns = dict([(name, d_array[:, i]) for i, name in
enumerate(colnames)])
  df = rnumpy.r.data_frame(**named_columns)
should work, and doesn't add any extra data copies compared to what
you had (because numpy slicing is cheap).

Fundamentally there isn't that much you can do to make this
crazy-fast, though. You have a jumble of individual Python objects
laid out in the wrong way in memory, and that's going to take some
Python-land fiddling around to get that straightened out. You have to
transpose the data somehow -- I don't know whether a Python loop or
copying the data into a np.array is faster. Then you have to detect
the Python object types and convert them into R. rnumpy uses numpy's
type detection code; this code is in C and quite fast, but it also has
to be conservative and check the entire column to make sure that
everything is of the same type, and it copies the data into a np.array
again before calling rinterface.*SexpVector -- you might be able to
beat it since you know your data is uniform and can construct the
SexpVector directly from the list/object array.

Also, if you need proper NA handling, you need another pass to handle
that, and making that pass work may defeat all your optimization
tricks. How exactly this should work depends on how you've coded NA's
in your data. (I see from your example that you use np.nan for a float
column, but what about other data types?) Note that np.nan is not the
same as R's NA. (However, you can pull a float representing NA out of
R and use it when building your data structures in the first place --
in rnumpy it's rnumpy.NA_NUMERIC[0]. It will look like a NaN to
Python, but convert properly.)

I'd suggest writing something that works, getting the analysis
pipeline running, and then profiling to see if this part matters...

-- Nathaniel

------------------------------------------------------------------------------
Come build with us! The BlackBerry&reg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9&#45;12, 2009. Register now&#33;
http://p.sf.net/sfu/devconf
_______________________________________________
rpy-list mailing list
rpy-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rpy-list

Re: [Rpy] Making dataframes ... fast

Reply via email to