(Belated continuation of a thread I started.)

Joel and Olivier, thanks for your comments.  I had seen the docs in 
http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
 but for some reason I thought there was significant functionality NLTK 
provided that sklearn didn’t.  R’s tm package is a very good useful package for 
transforming text into term matrices, and it didn’t look like sklearn had 
anything similar.

> On Sun, Jul 7, 2013 at 6:58 AM, Joel Nothman <jnoth...@student.usyd.edu.au> 
> wrote:
> I am not aware of a definitive, complete solution. Lars has built an 
> NLTK-compatible classifier interface in nltk.classify.scikitlearn, while 
> scikit-learn provides the various components in sklearn.feature_extraction 
> that handle text directly, or would allow you to readily produce arrays from 
> feature dicts.
> 
> I don't think there's any clear, generic way for them to interface better: 
> both systems prefer to interface with native types (dicts, numpy arrays) 
> rather than sophisticated framework components. 

Good point.  

> I also don't know what data you want to analyse in Pandas: the feature data? 
> the classification results?

The feature data.  Pandas’ DataFrame class seems like a rich, general purpose 
structure for storing, manipulating and plotting data for ML.  I’ve been trying 
to use it everywhere.  I realize it may not be a good choice for text, though.

Olivier: thanks for the pointer to sklearn-pandas and the DataFrameMapper 
class.  I’m always finding corners of packages I didn’t know existed.  It may 
not be a good choice for text, but it’s useful for a lot of other 
representation tasks I have to do.

> (But I'm also not convinced that NLTK is the right tool for a lot of 
> large-scale feature extraction jobs.)

I’m curious – why?

Thanks,
-Tom


------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to