On Tue, Sep 25, 2012 at 10:31:10AM -0700, Doug Coleman wrote:
> I'm making an ensemble of trees by hand for classification and trying
> to merge their outputs with predict_proba. My labels are integers
> -2..2. The problem is that -2 and 2 are rare labels. Now assume that I
> train these trees with different but related data sets, some of which
> don't even contain -2 or 2. The shape of predict_proba varies based on
> number of unique labels in the input y, so instead of always getting 5
> columns in predict_proba, I only get columns wherever there was a
> label.

I hate to say, but you are starting in a really difficult position for
learning. So far we do not have tools to work with very sparse output
classes. I think that such situations take a lot of care to get good
results.

For this reason, my own personnal opinion is that I wouldn't favor having
a 'quick fix' landing in the scikit that wouldn't solve the core
statistical problems. I understand that the bookeeping is tedious, but my
gut feeling is that solving it will just make other problems appear.

By the way, have you considered making 'stratified', or balanced
bootstraps, in which you would keep the class ratio constants? This would
help for bookeeping, but might also help for the statistical learning
problem.

Thanks for offering a patch, though, it is much appreciated,

Gaƫl

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to