I'd love to submit a patch. Looking at SGDClassifier docs, the __init__ doesn't take a classes parameter, but instead there's a partial_fit() that takes `classes` exactly like I'd except. However, the docs for partial_fit() are exactly the same as for fit().
If you examine the code, fit() "warms up" the optimization with some additional parameters, then calls _partial_fit(). partial_fit() just calls _partial_fit() directly. So, it looks like fit() and partial_fit() could take a `classes` parameter for SGDClassifier, rather than __init__. It seems a bit confused, actually, since SGDClassifier's __init__ takes a class_weight dict for doing cost-sensitive learning but then partial_fit() takes a classes vector--what if they contradict each other? It seems like the `class_weight` parameter in __init__ could be either a vector or a dict, where a vector would treat all weights equally and the dict would have the weights for cost-sensitive learning. Then, take the classes parameter out of partial_fit(). If the y vector ever has a class not in the classes vector and one was supplied in __init__, then you'd throw an error. Then do this for DecisionTreeClassifiers. What do you think? Doug On Tue, Sep 25, 2012 at 11:22 AM, Lars Buitinck <[email protected]> wrote: > 2012/9/25 Doug Coleman <[email protected]>: >> label. So to merge predictions from the trees, now I have to do >> bookkeeping to remember which trees had which labels in them, and it's >> a mess. > > You did discover the classes_ attribute, did you? That keeps track of > the classes found in y by fit and solves part of the bookkeeping > problem. > >> Someone suggested I use sklearn.feature_extraction.DictVectorizer, but >> that seems to be to track the X matrix instead of y. What I might end >> up doing is unique/sorting the y labels for each tree, calling >> predict_proba on each, adding column vectors of zeros to the >> predictions, and then merging the results. > > No, that's not what DictVectorizer is for. I guess it *could* be used > for tracking labels and probabilities, if you fit it on the trivial > "dataset" > > [dict((str(label),0) for label in [-2, -1, 0, 1, 2])] > > but then still, you have to convert from integers to strings all the time. > >> What I would prefer to do is call fit with a set of possible labels, >> like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the >> bookkeeping for me. Obviously some of the trees in my ensemble would >> be useless at predicting the -2 or 2 labels, but that's expected. > > That would be nice. I think we actually put that argument on __init__ > where appropriate (SGDClassifier) and call is classes, not labels. > Would you perhaps be willing to implement this for decision trees and > submit a pull request? > >> Maybe people don't usually use the library in this way so it doesn't come up? > > It only comes up in advanced use cases such as online learning, so > some estimators have this, but others don't. > > -- > Lars Buitinck > Scientific programmer, ILPS > University of Amsterdam > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
