Re: [Scikit-learn-general] Question about applying dimension reduction on text

SK Sn Wed, 26 Oct 2011 07:08:31 -0700

Thank you Olivier. I will have a look into NMF. :)
On 26 October 2011 15:43, Olivier Grisel <[email protected]> wrote:


> 2011/10/26 Gael Varoquaux <[email protected]>:
> > On Wed, Oct 26, 2011 at 03:02:28PM +0200, SK Sn wrote:
> >> Hi there, I am trying to apply and test several dimension reduction
> methods
> >> on 20Newsgroup data. However, I got errors, which I did not get how, on
> all
> >> of them except RandomPCA. Would you please help me to get a better
> >> understand of the issue?
> >
> >> X = Vectorizer(max_features=10000).fit_transform(data_set.data)
> >
> > I think that your problem is that X (returned by the Vectorizer) is a
> > sparse matrix, and that the different methods other than the
> > RandomizedPCA do not accept sparse matrices as inputs.
> >
> > You can make the data dense using
> > X = X.todense()
> >
> > This will consume much more memory, and might not be an option, though.
>
> Indeed, that will blow up with the default vectorizer configuration.
> You can pass a max_features=1000 or 10000 to the Vectorizer to
> restrict the number of features to extract to the most frequent tokens
> (you can combine this parameter with max_df=0.95 to get rid of the
> stop words as well) and then use todense() on the output of the
> vectorizer to copy the result as a dense numpy array.
>
> Beware that most "dense-input" algorithms will probably be very slow
> in high dimensions though. Right now RandomizedPCA is probably the
> only decomposition method from the scikit-learn project that can scale
> to both high n_samples and n_features (especially with sparse features
> as with text data).
>
> You can also try sklearn.decomposition.NMF which is able to work with
> sparse matrices as input directly but will be slower than
> RandomizedPCA. There is an example here:
>
>
> http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf.html#example-applications-topics-extraction-with-nmf-py
>
> MiniBatchKMeans albeit not strictly a decomposition method (it is
> doing clustering but the cluster centers can be treated as components)
> will be able to handle text data quite efficiently too (and also huge
> speedups to be expected soon :) There is also an example here :
>
>
> http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py
>
> A note for the scikit-learn developers:
>
> => we should definitely improve the tooling for checking the input and
> emit informative ValueError messages that state explicitly that
> scipy.sparse matrices are not supported as input for the models
> mentioned by the poster.
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
> ------------------------------------------------------------------------------
> The demand for IT networking professionals continues to grow, and the
> demand for specialized networking skills is growing even more rapidly.
> Take a complimentary Learning@Cisco Self-Assessment and learn
> about Cisco certifications, training, and career opportunities.
> http://p.sf.net/sfu/cisco-dev2dev
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
The demand for IT networking professionals continues to grow, and the
demand for specialized networking skills is growing even more rapidly.
Take a complimentary Learning@Cisco Self-Assessment and learn 
about Cisco certifications, training, and career opportunities. 
http://p.sf.net/sfu/cisco-dev2dev

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about applying dimension reduction on text

Reply via email to