2012/9/5 Ark <[email protected]>:
>     What would be the best approach to classify a large dataset with sparse
> features, into multiple categories.

How large (in bytes and in which format)? What are n_samples,
n_features and n_classes?

> I referred to the multiclass page in the
> sklearn documentation, but was not sure on which one to use for multiclass
> probabilities [top n probabilities would be nice].
>     I tried using different classifiers but see some issues:
> SGDClassifier: get good result but see "Not Implemented" error when I use
>                predict_proba
> LinearSVC: No method to get probabilities
> LDA: get an exception   "A sparse matrix was passed, but dense data is 
> required.
> Use X.todense() to convert to dense." upon doing which doesnt work well.
>
> Presently I have 64g limitations on memory and 100g disc space.
> Any suggestions?

You can try sklearn.linear_model.LogisticRegression that is both
scalable, support sparse input and probability estimates.
Unfortunately it is based on liblinear that does not have the same
memory layout as scipy.sparse matrices which means that the dataset
will be duplicated in memory.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to