2011/10/26 Gael Varoquaux <[email protected]>: > On Wed, Oct 26, 2011 at 03:02:28PM +0200, SK Sn wrote: >> Hi there, I am trying to apply and test several dimension reduction methods >> on 20Newsgroup data. However, I got errors, which I did not get how, on all >> of them except RandomPCA. Would you please help me to get a better >> understand of the issue? > >> X = Vectorizer(max_features=10000).fit_transform(data_set.data) > > I think that your problem is that X (returned by the Vectorizer) is a > sparse matrix, and that the different methods other than the > RandomizedPCA do not accept sparse matrices as inputs. > > You can make the data dense using > X = X.todense() > > This will consume much more memory, and might not be an option, though.
Indeed, that will blow up with the default vectorizer configuration. You can pass a max_features=1000 or 10000 to the Vectorizer to restrict the number of features to extract to the most frequent tokens (you can combine this parameter with max_df=0.95 to get rid of the stop words as well) and then use todense() on the output of the vectorizer to copy the result as a dense numpy array. Beware that most "dense-input" algorithms will probably be very slow in high dimensions though. Right now RandomizedPCA is probably the only decomposition method from the scikit-learn project that can scale to both high n_samples and n_features (especially with sparse features as with text data). You can also try sklearn.decomposition.NMF which is able to work with sparse matrices as input directly but will be slower than RandomizedPCA. There is an example here: http://scikit-learn.org/dev/auto_examples/applications/topics_extraction_with_nmf.html#example-applications-topics-extraction-with-nmf-py MiniBatchKMeans albeit not strictly a decomposition method (it is doing clustering but the cluster centers can be treated as components) will be able to handle text data quite efficiently too (and also huge speedups to be expected soon :) There is also an example here : http://scikit-learn.org/dev/auto_examples/document_clustering.html#example-document-clustering-py A note for the scikit-learn developers: => we should definitely improve the tooling for checking the input and emit informative ValueError messages that state explicitly that scipy.sparse matrices are not supported as input for the models mentioned by the poster. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ The demand for IT networking professionals continues to grow, and the demand for specialized networking skills is growing even more rapidly. Take a complimentary Learning@Cisco Self-Assessment and learn about Cisco certifications, training, and career opportunities. http://p.sf.net/sfu/cisco-dev2dev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
