I'd love to submit a patch.

Looking at SGDClassifier docs, the __init__ doesn't take a classes
parameter, but instead there's a partial_fit() that takes `classes`
exactly like I'd except. However, the docs for partial_fit() are
exactly the same as for fit().

If you examine the code, fit() "warms up" the optimization with some
additional parameters, then calls _partial_fit().  partial_fit() just
calls _partial_fit() directly. So, it looks like fit() and
partial_fit() could take a `classes` parameter for SGDClassifier,
rather than __init__. It seems a bit confused, actually, since
SGDClassifier's __init__ takes a class_weight dict for doing
cost-sensitive learning but then partial_fit() takes a classes
vector--what if they contradict each other?

It seems like the `class_weight` parameter in __init__ could be either
a vector or a dict, where a vector would treat all weights equally and
the dict would have the weights for cost-sensitive learning. Then,
take the classes parameter out of partial_fit(). If the y vector ever
has a class not in the classes vector and one was supplied in
__init__, then you'd throw an error. Then do this for
DecisionTreeClassifiers.

What do you think?

Doug

On Tue, Sep 25, 2012 at 11:22 AM, Lars Buitinck <[email protected]> wrote:
> 2012/9/25 Doug Coleman <[email protected]>:
>> label. So to merge predictions from the trees, now I have to do
>> bookkeeping to remember which trees had which labels in them, and it's
>> a mess.
>
> You did discover the classes_ attribute, did you? That keeps track of
> the classes found in y by fit and solves part of the bookkeeping
> problem.
>
>> Someone suggested I use sklearn.feature_extraction.DictVectorizer, but
>> that seems to be to track the X matrix instead of y. What I might end
>> up doing is unique/sorting the y labels for each tree, calling
>> predict_proba on each, adding column vectors of zeros to the
>> predictions, and then merging the results.
>
> No, that's not what DictVectorizer is for. I guess it *could* be used
> for tracking labels and probabilities, if you fit it on the trivial
> "dataset"
>
> [dict((str(label),0) for label in [-2, -1, 0, 1, 2])]
>
> but then still, you have to convert from integers to strings all the time.
>
>> What I would prefer to do is call fit with a set of possible labels,
>> like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the
>> bookkeeping for me. Obviously some of the trees in my ensemble would
>> be useless at predicting the -2 or 2 labels, but that's expected.
>
> That would be nice. I think we actually put that argument on __init__
> where appropriate (SGDClassifier) and call is classes, not labels.
> Would you perhaps be willing to implement this for decision trees and
> submit a pull request?
>
>> Maybe people don't usually use the library in this way so it doesn't come up?
>
> It only comes up in advanced use cases such as online learning, so
> some estimators have this, but others don't.
>
> --
> Lars Buitinck
> Scientific programmer, ILPS
> University of Amsterdam
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to