Hi, During the weekend I was trying to play with GradientBoostingClassifier and a dataset with a large number of classes. I observed an issue during ".fit()", due to the MultinomialDeviance class code (sklearn v0.11): ---- /usr/lib/pymodules/python2.7/sklearn/ensemble/gradient_boosting.py:299: RuntimeWarning: invalid value encountered in true_divide return y - np.exp(pred[:, k]) / np.sum(np.exp(pred), axis=1) ---- This warning resulted in some NaN values that corrupted the final result and led to a NaN score.
The motivation of the issue is not surprising given that the sum of exps frequently leads to numerical underflows. I've patched the original implementation and here is the diff (note: there are two other points in the code that trigger a similar issue if you solve the problem in line 299, the last one is in predict_proba() of GradientBoostingClassifier): --- $ diff gradient_boosting.py.original gradient_boosting.py.improved 295,296c295,296 < np.log(np.exp(pred).sum(axis=1))) < --- > np.logaddexp.reduce(pred, axis=1)) > 299c299 < return y - np.exp(pred[:, k]) / np.sum(np.exp(pred), axis=1) --- > return y - np.nan_to_num(np.exp(pred[:, k] - > np.logaddexp.reduce(pred, axis=1))) 683c683 < proba = np.exp(score) / np.sum(np.exp(score), axis=1)[:, np.newaxis] --- > proba = np.nan_to_num(np.exp(score - (np.logaddexp.reduce(score, > axis=1)[:, np.newaxis]))) ---- As you can see the solution is very simple and just based on np.logaddexp.reduce() instead of np.exp().sum(), plus np.nan_to_num() and a little rearrangement. I can prepare a pull request if you are interested. With this little patch the issue disappears and GradientBoostingClassifier gives the expected answers. Note that I haven't a simple toy example to reproduce the issue and the actual dataset I'm using is large. Anyway I am sure that with a little bit of time it would be possibile to write a simple example that tricks MultinomialDeviance as described above. Best, Emanuele ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
