I'm not necessarily looking for a quick fix here, and anything I would consider trying to contribute to scikit would be useful and correct.
You're right that there's not a good chance it can learn to predict with sparse output classes, but if the problem were easy, then I wouldn't need scikit at all. I just wanted to try out an idea and the API is kind of getting in the way. If the output labels were not be collected out of the y vector but instead provided as a parameter to tell the classifier what I'm looking for independently, as SGDClassifier supports, then that would solve the problem. Maybe the right thing to do is open up an issue about the discrepancy in the API on github and either hope someone else wants to fix it or submit patches myself eventually. Just out of curiosity, what problems do you think could arise from this other than ultimately the machine learning effort fails because of sparsity? Thanks, Doug On Tue, Sep 25, 2012 at 1:57 PM, Gael Varoquaux <[email protected]> wrote: > On Tue, Sep 25, 2012 at 10:31:10AM -0700, Doug Coleman wrote: >> I'm making an ensemble of trees by hand for classification and trying >> to merge their outputs with predict_proba. My labels are integers >> -2..2. The problem is that -2 and 2 are rare labels. Now assume that I >> train these trees with different but related data sets, some of which >> don't even contain -2 or 2. The shape of predict_proba varies based on >> number of unique labels in the input y, so instead of always getting 5 >> columns in predict_proba, I only get columns wherever there was a >> label. > > I hate to say, but you are starting in a really difficult position for > learning. So far we do not have tools to work with very sparse output > classes. I think that such situations take a lot of care to get good > results. > > For this reason, my own personnal opinion is that I wouldn't favor having > a 'quick fix' landing in the scikit that wouldn't solve the core > statistical problems. I understand that the bookeeping is tedious, but my > gut feeling is that solving it will just make other problems appear. > > By the way, have you considered making 'stratified', or balanced > bootstraps, in which you would keep the class ratio constants? This would > help for bookeeping, but might also help for the statistical learning > problem. > > Thanks for offering a patch, though, it is much appreciated, > > Gaƫl > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
