Re: [Scikit-learn-general] Question about applying dimension reduction on text

Olivier Grisel Wed, 26 Oct 2011 06:44:09 -0700

2011/10/26 Gael Varoquaux <[email protected]>:
> On Wed, Oct 26, 2011 at 03:02:28PM +0200, SK Sn wrote:
>> Hi there, I am trying to apply and test several dimension reduction methods
>> on 20Newsgroup data. However, I got errors, which I did not get how, on all
>> of them except RandomPCA. Would you please help me to get a better
>> understand of the issue?
>
>> X = Vectorizer(max_features=10000).fit_transform(data_set.data)
>
> I think that your problem is that X (returned by the Vectorizer) is a
> sparse matrix, and that the different methods other than the
> RandomizedPCA do not accept sparse matrices as inputs.
>
> You can make the data dense using
> X = X.todense()
>
> This will consume much more memory, and might not be an option, though.


Indeed, that will blow up with the default vectorizer configuration.
You can pass a max_features=1000 or 10000 to the Vectorizer to
restrict the number of features to extract to the most frequent tokens
(you can combine this parameter with max_df=0.95 to get rid of the
stop words as well) and then use todense() on the output of the
vectorizer to copy the result as a dense numpy array.

Beware that most "dense-input" algorithms will probably be very slow
in high dimensions though. Right now RandomizedPCA is probably the
only decomposition method from the scikit-learn project that can scale
to both high n_samples and n_features (especially with sparse features
as with text data).

You can also try sklearn.decomposition.NMF which is able to work with
sparse matrices as input directly but will be slower than
RandomizedPCA. There is an example here:

  
http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf.html#example-applications-topics-extraction-with-nmf-py

MiniBatchKMeans albeit not strictly a decomposition method (it is
doing clustering but the cluster centers can be treated as components)
will be able to handle text data quite efficiently too (and also huge
speedups to be expected soon :) There is also an example here :

  
http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py

A note for the scikit-learn developers:

=> we should definitely improve the tooling for checking the input and
emit informative ValueError messages that state explicitly that
scipy.sparse matrices are not supported as input for the models
mentioned by the poster.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about applying dimension reduction on text

Reply via email to