2012/9/5 Ark <[email protected]>: > What would be the best approach to classify a large dataset with sparse > features, into multiple categories.
How large (in bytes and in which format)? What are n_samples, n_features and n_classes? > I referred to the multiclass page in the > sklearn documentation, but was not sure on which one to use for multiclass > probabilities [top n probabilities would be nice]. > I tried using different classifiers but see some issues: > SGDClassifier: get good result but see "Not Implemented" error when I use > predict_proba > LinearSVC: No method to get probabilities > LDA: get an exception "A sparse matrix was passed, but dense data is > required. > Use X.todense() to convert to dense." upon doing which doesnt work well. > > Presently I have 64g limitations on memory and 100g disc space. > Any suggestions? You can try sklearn.linear_model.LogisticRegression that is both scalable, support sparse input and probability estimates. Unfortunately it is based on liblinear that does not have the same memory layout as scipy.sparse matrices which means that the dataset will be duplicated in memory. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
