Hi,

During the weekend I was trying to play with GradientBoostingClassifier
and a dataset with a large number of classes. I observed an issue
during ".fit()", due to the MultinomialDeviance class code (sklearn v0.11):
----
/usr/lib/pymodules/python2.7/sklearn/ensemble/gradient_boosting.py:299: 
RuntimeWarning: 
invalid value encountered in true_divide
   return y - np.exp(pred[:, k]) / np.sum(np.exp(pred), axis=1)
----
This warning resulted in some NaN values that corrupted the final result
and led to a NaN score.

The motivation of the issue is not surprising given that the sum
of exps frequently leads to numerical underflows. I've patched
the original implementation and here is the diff (note: there are
two other points in the code that trigger a similar issue if you
solve the problem in line 299, the last one is in predict_proba()
of GradientBoostingClassifier):
---
$ diff gradient_boosting.py.original gradient_boosting.py.improved
295,296c295,296
<                       np.log(np.exp(pred).sum(axis=1)))
<
---
 >                       np.logaddexp.reduce(pred, axis=1))
 >
299c299
<         return y - np.exp(pred[:, k]) / np.sum(np.exp(pred), axis=1)
---
 >         return y - np.nan_to_num(np.exp(pred[:, k] - 
 > np.logaddexp.reduce(pred, axis=1)))
683c683
<             proba = np.exp(score) / np.sum(np.exp(score), axis=1)[:, 
np.newaxis]
---
 >             proba = np.nan_to_num(np.exp(score - (np.logaddexp.reduce(score, 
 > axis=1)[:, 
np.newaxis])))
----

As you can see the solution is very simple and just based on 
np.logaddexp.reduce()
instead of np.exp().sum(), plus np.nan_to_num() and a little rearrangement.
I can prepare a pull request if you are interested.

With this little patch the issue disappears and GradientBoostingClassifier
gives the expected answers.

Note that I haven't a simple toy example to reproduce the issue and
the actual dataset I'm using is large. Anyway I am sure that with a
little bit of time it would be possibile to write a simple example
that tricks MultinomialDeviance as described above.

Best,

Emanuele




------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to