(Belated continuation of a thread I started.) Joel and Olivier, thanks for your comments. I had seen the docs in http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction but for some reason I thought there was significant functionality NLTK provided that sklearn didn’t. R’s tm package is a very good useful package for transforming text into term matrices, and it didn’t look like sklearn had anything similar.
> On Sun, Jul 7, 2013 at 6:58 AM, Joel Nothman <jnoth...@student.usyd.edu.au> > wrote: > I am not aware of a definitive, complete solution. Lars has built an > NLTK-compatible classifier interface in nltk.classify.scikitlearn, while > scikit-learn provides the various components in sklearn.feature_extraction > that handle text directly, or would allow you to readily produce arrays from > feature dicts. > > I don't think there's any clear, generic way for them to interface better: > both systems prefer to interface with native types (dicts, numpy arrays) > rather than sophisticated framework components. Good point. > I also don't know what data you want to analyse in Pandas: the feature data? > the classification results? The feature data. Pandas’ DataFrame class seems like a rich, general purpose structure for storing, manipulating and plotting data for ML. I’ve been trying to use it everywhere. I realize it may not be a good choice for text, though. Olivier: thanks for the pointer to sklearn-pandas and the DataFrameMapper class. I’m always finding corners of packages I didn’t know existed. It may not be a good choice for text, but it’s useful for a lot of other representation tasks I have to do. > (But I'm also not convinced that NLTK is the right tool for a lot of > large-scale feature extraction jobs.) I’m curious – why? Thanks, -Tom ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general