On Tue, Sep 25, 2012 at 10:31:10AM -0700, Doug Coleman wrote: > I'm making an ensemble of trees by hand for classification and trying > to merge their outputs with predict_proba. My labels are integers > -2..2. The problem is that -2 and 2 are rare labels. Now assume that I > train these trees with different but related data sets, some of which > don't even contain -2 or 2. The shape of predict_proba varies based on > number of unique labels in the input y, so instead of always getting 5 > columns in predict_proba, I only get columns wherever there was a > label.
I hate to say, but you are starting in a really difficult position for learning. So far we do not have tools to work with very sparse output classes. I think that such situations take a lot of care to get good results. For this reason, my own personnal opinion is that I wouldn't favor having a 'quick fix' landing in the scikit that wouldn't solve the core statistical problems. I understand that the bookeeping is tedious, but my gut feeling is that solving it will just make other problems appear. By the way, have you considered making 'stratified', or balanced bootstraps, in which you would keep the class ratio constants? This would help for bookeeping, but might also help for the statistical learning problem. Thanks for offering a patch, though, it is much appreciated, Gaƫl ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
