On Tue, Sep 29, 2009 at 4:21 AM, Gary Strangman <str...@nmr.mgh.harvard.edu> wrote: > Without benchmarking, that seems mighty inefficient. Nathaniel Smith's > rnumpy mostly allows the following: > > df = rnumpy.r.data_frame(numpy.array(d,np.object)) > > ... which is 2 conversions (rather than 4), but I haven't been able to get > the column names attached in this case. (My inexperience, I'm sure.)
Something like d_array = np.array(d, dtype=object) named_columns = dict([(name, d_array[:, i]) for i, name in enumerate(colnames)]) df = rnumpy.r.data_frame(**named_columns) should work, and doesn't add any extra data copies compared to what you had (because numpy slicing is cheap). Fundamentally there isn't that much you can do to make this crazy-fast, though. You have a jumble of individual Python objects laid out in the wrong way in memory, and that's going to take some Python-land fiddling around to get that straightened out. You have to transpose the data somehow -- I don't know whether a Python loop or copying the data into a np.array is faster. Then you have to detect the Python object types and convert them into R. rnumpy uses numpy's type detection code; this code is in C and quite fast, but it also has to be conservative and check the entire column to make sure that everything is of the same type, and it copies the data into a np.array again before calling rinterface.*SexpVector -- you might be able to beat it since you know your data is uniform and can construct the SexpVector directly from the list/object array. Also, if you need proper NA handling, you need another pass to handle that, and making that pass work may defeat all your optimization tricks. How exactly this should work depends on how you've coded NA's in your data. (I see from your example that you use np.nan for a float column, but what about other data types?) Note that np.nan is not the same as R's NA. (However, you can pull a float representing NA out of R and use it when building your data structures in the first place -- in rnumpy it's rnumpy.NA_NUMERIC[0]. It will look like a NaN to Python, but convert properly.) I'd suggest writing something that works, getting the analysis pipeline running, and then profiling to see if this part matters... -- Nathaniel ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ rpy-list mailing list rpy-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rpy-list