Hi, I'm making an ensemble of trees by hand for classification and trying to merge their outputs with predict_proba. My labels are integers -2..2. The problem is that -2 and 2 are rare labels. Now assume that I train these trees with different but related data sets, some of which don't even contain -2 or 2. The shape of predict_proba varies based on number of unique labels in the input y, so instead of always getting 5 columns in predict_proba, I only get columns wherever there was a label. So to merge predictions from the trees, now I have to do bookkeeping to remember which trees had which labels in them, and it's a mess.
Someone suggested I use sklearn.feature_extraction.DictVectorizer, but that seems to be to track the X matrix instead of y. What I might end up doing is unique/sorting the y labels for each tree, calling predict_proba on each, adding column vectors of zeros to the predictions, and then merging the results. What I would prefer to do is call fit with a set of possible labels, like so: clf.fit(X, y, labels=[-2,1,0,1,2]) so scikit could do the bookkeeping for me. Obviously some of the trees in my ensemble would be useless at predicting the -2 or 2 labels, but that's expected. An analogous example is randomly selecting and training on rows where the y values are not all represented. This is taken care of for DecisionTreeClassifiers by the max_features='auto' parameter already, internally. Maybe people don't usually use the library in this way so it doesn't come up? Thanks, Doug ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
