Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

Mathieu Blondel Thu, 30 Jul 2015 06:39:32 -0700

While the Gaussian distribution has a PDF, the Poisson distribution has a
PMF. From Wikipedia (https://en.wikipedia.org/wiki/Probability_mass_function
):


"A probability mass function differs from a probability density function
(pdf) in that the latter is associated with continuous rather than discrete
random variables; the values of the latter are not probabilities as such: a
pdf must be integrated over an interval to yield a probability"

So in the case of Poisson regression, p(y|x) is a true probability. For
this reason, I tend to prefer predict_proba or predict_proba_at for Poisson
regression. Another argument in favor of predict_proba or predict_proba_at
is that it is a conditional probability, the same as for classifiers.

Incidentally, `git grep` tells met that we have apparently never used
score_samples(X, y), only score_samples(X).
So if we overload the meaning of score_samples, we might as well overload
the meaning of predict_proba.

Also, while X is [n_samples, n_features], in the use case I was thinking, y
wouldn't be [n_samples]:

- It would be either a scalar and in this case the output of the method
would be an array of shape [n_samples] containing p(y|x_i) for all x_i

- Or y would be an array of shape [n_values], in which case the output
would be [n_samples, n_values]. In other words, we can query the
probabilities of all x_i at different values of y. The values must be
integers in the case of Poisson regression.

BTW, I raised the issue that using predict_proba in a regressor might be
problematic, since we (used to) use predict_proba to dectect classifiers.
Is it really an issue, now that we have the tag system and is_classifier()?



On Thu, Jul 30, 2015 at 9:11 PM, Jan Hendrik Metzen <
j...@informatik.uni-bremen.de> wrote:

> That's true, I wasn't aware that score_samples is used already in the
> context of density estimation. score_samples would be okay then in my
> opinion.
>
> Jan
>
> On 29.07.2015 18:46, Andreas Mueller wrote:
> > Hm, I'm not entirely sure how score_samples is currently used, but I
> > think it is the probability
> > under a density model.
> > It would "only" change the meaning in so far as it is a conditional
> > distribution over y given x and not x.
> >
> > I'm not totally opposed to adding a new method, though I'm not sure I
> > like ``predict_proba_at``
> >
> > On 07/29/2015 12:29 PM, Jan Hendrik Metzen wrote:
> >> I am not sure about the name, score_samples would sound a bit strange
> >> for a conditional probability in my opinion. And likelihood is also
> >> misleading since its actually a conditional probability and not a
> >> conditional likelihood (the quantities on the right-hand side of
> >> conditioning are fixed and integrating over all y would be 1).
> >>
> >> On 29.07.2015 16:16, Andreas Mueller wrote:
> >>> Shouldn't that be "score_samples"?
> >>> Well, it is a conditional likelihood p(y|x), not p(x) or p(x, y).
> >>> But it is the likelihood of some data given the model.
> >>>
> >>>
> >>> On 07/29/2015 02:58 AM, Jan Hendrik Metzen wrote:
> >>>> Such a predict_proba_at() method would also make sense for Gaussian
> >>>> process regression. Currently, computing probability densities for GPs
> >>>> requires predicting mean and standard deviation (via "MSE") at X and
> >>>> using scipy.stats.norm.pdf to compute probability densities for y for
> >>>> the predicted mean and standard-deviation. I think it would be nice to
> >>>> allow this directily via the API. Thus +1 for adding a method like
> >>>> predict_proba_at().
> >>>>
> >>>> Jan
> >>>>
> >>>> On 29.07.2015 06:42, Mathieu Blondel wrote:
> >>>>> Regarding predictions, I don't really see what's the problem. Using
> >>>>> GLMs as an example, you just need to do
> >>>>>
> >>>>> def predict(self, X):
> >>>>>         if self.loss == "poisson":
> >>>>>             return np.exp(np.dot(X, self.coef_))
> >>>>>         else:
> >>>>>             return np.dot(X, self.coef_)
> >>>>>
> >>>>> A nice thing about Poisson regression is that we can query the
> >>>>> probability p(y|x) for a specific integer y.
> >>>>> https://en.wikipedia.org/wiki/Poisson_regression
> >>>>>
> >>>>> We need to decide an API for that (so far we have used predict_proba
> >>>>> for classification so the output was always n_samples x n_classes).
> >>>>> How about predict_proba(X, at_y=some_integer)?
> >>>>>
> >>>>> However, this is also mean that we can't use predict_proba to detect
> >>>>> classifiers anymore...
> >>>>> Another solution would be to introduce a new method
> >>>>> predict_proba_at(X, y=some_integer)...
> >>>>>
> >>>>> Mathieu
> >>>>>
> >>>>>
> >>>>> On Wed, Jul 29, 2015 at 4:19 AM, Andreas Mueller <t3k...@gmail.com
> >>>>> <mailto:t3k...@gmail.com>> wrote:
> >>>>>
> >>>>>         I was expecting there to be the actual poisson loss
> implemented in
> >>>>>         the class, not just a log transform.
> >>>>>
> >>>>>
> >>>>>
> >>>>>         On 07/28/2015 02:03 PM, josef.p...@gmail.com
> >>>>>         <mailto:josef.p...@gmail.com> wrote:
> >>>>>>         Just a comment from the statistics sidelines
> >>>>>>
> >>>>>>         taking log of target and fitting a linear or other model
> doesn't
> >>>>>>         make it into a Poisson model.
> >>>>>>
> >>>>>>         But maybe "Poisson loss" in machine learning is unrelated
> to the
> >>>>>>         Poisson distribution or a Poisson model with E(y| x) =
> exp(x beta). ?
> >>>>>>
> >>>>>>         Josef
> >>>>>>
> >>>>>>
> >>>>>>         On Tue, Jul 28, 2015 at 2:46 PM, Andreas Mueller
> >>>>>>         <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote:
> >>>>>>
> >>>>>>             I'd be happy with adding Poisson loss to more models,
> thought
> >>>>>>             I think it would be more natural to first add it to GLM
> >>>>>>             before GBM ;)
> >>>>>>             If the addition is straight-forward, I think it would
> be a
> >>>>>>             nice contribution nevertheless.
> >>>>>>             1) for the user to do np.exp(gbmpoisson.predict(X)) is
> not
> >>>>>>             acceptable. This needs to be automatic. It would be
> best if
> >>>>>>             this could be done in a minimally intrusive way.
> >>>>>>
> >>>>>>             2) I'm not sure, maybe Peter can comment?
> >>>>>>
> >>>>>>             3) I would rather contribute sooner, but other might
> thing
> >>>>>>             differently. Silently ignoring sample weights is not an
> >>>>>>             option, but you can error if they are provided.
> >>>>>>
> >>>>>>             Hth,
> >>>>>>             Andy
> >>>>>>
> >>>>>>
> >>>>>>             On 07/23/2015 08:52 PM, Peter Rickwood wrote:
> >>>>>>>             Hello sklearn developers,
> >>>>>>>
> >>>>>>>             I'd like the GBM implementation in sklearn to support
> >>>>>>>             Poisson loss, and I'm comfortable in writing the code
> (I
> >>>>>>>             have modified my local sklearn source already and am
> using
> >>>>>>>             Poisson loss GBM's).
> >>>>>>>
> >>>>>>>             The sklearn site says to get in touch via this list
> before
> >>>>>>>             making a contribution, so is it worth me to submitting
> >>>>>>>             something along these lines?
> >>>>>>>
> >>>>>>>             If the answer is yes, some quick questions:
> >>>>>>>
> >>>>>>>             1) The simplest implementation of poisson loss GBMs is
> to
> >>>>>>>             work in log-space (i.e. the GBM predicts log(target)
> rather
> >>>>>>>             than target), and require the user to then take the
> >>>>>>>             exponential of those predictions. So, you would need
> to do
> >>>>>>>             something like:
> >>>>>>>                       gbmpoisson =
> >>>>>>>             sklearn.ensemble.GradientBoostingRegressor(...)
> >>>>>>>             gbmpoisson.fit(X,y)
> >>>>>>>                       preds = np.exp(predict(X))
> >>>>>>>             I am comfortable making changes to the source for this
> to
> >>>>>>>             work, but I'm not comfortable changing any of the
> >>>>>>>             higher-level interface to deal automatically with the
> >>>>>>>             transform. In other words, other developers would need
> to
> >>>>>>>             either be OK with the GBM returning transformed
> predictions
> >>>>>>>             in the case where "poisson" loss is chosen, or would
> need to
> >>>>>>>             change code in the 'predict' function to automatically
> do
> >>>>>>>             the transformation is poisson loss was specified. Is
> this OK?
> >>>>>>>             2) If I do contribute, can you advise what the best
> tests
> >>>>>>>             are to test/validate GBM loss functions before they are
> >>>>>>>             considered to 'work'?
> >>>>>>>
> >>>>>>>             3) Allowing for weighted samples is in theory easy
> enough to
> >>>>>>>             implement, but is not something I have implemented
> yet. Is
> >>>>>>>             it better to contribute code sooner that doesn't handle
> >>>>>>>             weighting (i.e. just ignores sample weights), or later
> that
> >>>>>>>             does?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>             Cheers, and thanks for all your work on sklearn.
> Fantastic
> >>>>>>>             tool/library,
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>             Peter
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
>  
> ------------------------------------------------------------------------------
> >>>>>>>
> >>>>>>>
> >>>>>>>             _______________________________________________
> >>>>>>>             Scikit-learn-general mailing list
> >>>>>>>             Scikit-learn-general@lists.sourceforge.net  <mailto:
> Scikit-learn-general@lists.sourceforge.net>
> >>>>>>>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>>>
>  
> ------------------------------------------------------------------------------
> >>>>>>
> >>>>>>             _______________________________________________
> >>>>>>             Scikit-learn-general mailing list
> >>>>>>             Scikit-learn-general@lists.sourceforge.net
> >>>>>>             <mailto:Scikit-learn-general@lists.sourceforge.net>
> >>>>>>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
>  
> ------------------------------------------------------------------------------
> >>>>>>
> >>>>>>
> >>>>>>         _______________________________________________
> >>>>>>         Scikit-learn-general mailing list
> >>>>>>         Scikit-learn-general@lists.sourceforge.net  <mailto:
> Scikit-learn-general@lists.sourceforge.net>
> >>>>>>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>>
>  
> ------------------------------------------------------------------------------
> >>>>>
> >>>>>         _______________________________________________
> >>>>>         Scikit-learn-general mailing list
> >>>>>         Scikit-learn-general@lists.sourceforge.net
> >>>>>         <mailto:Scikit-learn-general@lists.sourceforge.net>
> >>>>>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> ------------------------------------------------------------------------------
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Scikit-learn-general mailing list
> >>>>> Scikit-learn-general@lists.sourceforge.net
> >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>
> ------------------------------------------------------------------------------
> >>> _______________________________________________
> >>> Scikit-learn-general mailing list
> >>> Scikit-learn-general@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>>
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
> --
>   Jan Hendrik Metzen,  Dr.rer.nat.
>   Team Leader of Team "Sustained Learning"
>
>   Universität Bremen und DFKI GmbH, Robotics Innovation Center
>   FB 3 - Mathematik und Informatik
>   AG Robotik
>   Robert-Hooke-Straße 1
>   28359 Bremen, Germany
>
>
>   Tel.:     +49 421 178 45-4123
>   Zentrale: +49 421 178 45-6611
>   Fax:      +49 421 178 45-4150
>   E-Mail:   j...@informatik.uni-bremen.de
>   Homepage: http://www.informatik.uni-bremen.de/~jhm/
>
>   Weitere Informationen: http://www.informatik.uni-bremen.de/robotik
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Possible code contribution (Poisson loss)

Reply via email to