While the Gaussian distribution has a PDF, the Poisson distribution has a PMF. From Wikipedia (https://en.wikipedia.org/wiki/Probability_mass_function ):
"A probability mass function differs from a probability density function (pdf) in that the latter is associated with continuous rather than discrete random variables; the values of the latter are not probabilities as such: a pdf must be integrated over an interval to yield a probability" So in the case of Poisson regression, p(y|x) is a true probability. For this reason, I tend to prefer predict_proba or predict_proba_at for Poisson regression. Another argument in favor of predict_proba or predict_proba_at is that it is a conditional probability, the same as for classifiers. Incidentally, `git grep` tells met that we have apparently never used score_samples(X, y), only score_samples(X). So if we overload the meaning of score_samples, we might as well overload the meaning of predict_proba. Also, while X is [n_samples, n_features], in the use case I was thinking, y wouldn't be [n_samples]: - It would be either a scalar and in this case the output of the method would be an array of shape [n_samples] containing p(y|x_i) for all x_i - Or y would be an array of shape [n_values], in which case the output would be [n_samples, n_values]. In other words, we can query the probabilities of all x_i at different values of y. The values must be integers in the case of Poisson regression. BTW, I raised the issue that using predict_proba in a regressor might be problematic, since we (used to) use predict_proba to dectect classifiers. Is it really an issue, now that we have the tag system and is_classifier()? On Thu, Jul 30, 2015 at 9:11 PM, Jan Hendrik Metzen < j...@informatik.uni-bremen.de> wrote: > That's true, I wasn't aware that score_samples is used already in the > context of density estimation. score_samples would be okay then in my > opinion. > > Jan > > On 29.07.2015 18:46, Andreas Mueller wrote: > > Hm, I'm not entirely sure how score_samples is currently used, but I > > think it is the probability > > under a density model. > > It would "only" change the meaning in so far as it is a conditional > > distribution over y given x and not x. > > > > I'm not totally opposed to adding a new method, though I'm not sure I > > like ``predict_proba_at`` > > > > On 07/29/2015 12:29 PM, Jan Hendrik Metzen wrote: > >> I am not sure about the name, score_samples would sound a bit strange > >> for a conditional probability in my opinion. And likelihood is also > >> misleading since its actually a conditional probability and not a > >> conditional likelihood (the quantities on the right-hand side of > >> conditioning are fixed and integrating over all y would be 1). > >> > >> On 29.07.2015 16:16, Andreas Mueller wrote: > >>> Shouldn't that be "score_samples"? > >>> Well, it is a conditional likelihood p(y|x), not p(x) or p(x, y). > >>> But it is the likelihood of some data given the model. > >>> > >>> > >>> On 07/29/2015 02:58 AM, Jan Hendrik Metzen wrote: > >>>> Such a predict_proba_at() method would also make sense for Gaussian > >>>> process regression. Currently, computing probability densities for GPs > >>>> requires predicting mean and standard deviation (via "MSE") at X and > >>>> using scipy.stats.norm.pdf to compute probability densities for y for > >>>> the predicted mean and standard-deviation. I think it would be nice to > >>>> allow this directily via the API. Thus +1 for adding a method like > >>>> predict_proba_at(). > >>>> > >>>> Jan > >>>> > >>>> On 29.07.2015 06:42, Mathieu Blondel wrote: > >>>>> Regarding predictions, I don't really see what's the problem. Using > >>>>> GLMs as an example, you just need to do > >>>>> > >>>>> def predict(self, X): > >>>>> if self.loss == "poisson": > >>>>> return np.exp(np.dot(X, self.coef_)) > >>>>> else: > >>>>> return np.dot(X, self.coef_) > >>>>> > >>>>> A nice thing about Poisson regression is that we can query the > >>>>> probability p(y|x) for a specific integer y. > >>>>> https://en.wikipedia.org/wiki/Poisson_regression > >>>>> > >>>>> We need to decide an API for that (so far we have used predict_proba > >>>>> for classification so the output was always n_samples x n_classes). > >>>>> How about predict_proba(X, at_y=some_integer)? > >>>>> > >>>>> However, this is also mean that we can't use predict_proba to detect > >>>>> classifiers anymore... > >>>>> Another solution would be to introduce a new method > >>>>> predict_proba_at(X, y=some_integer)... > >>>>> > >>>>> Mathieu > >>>>> > >>>>> > >>>>> On Wed, Jul 29, 2015 at 4:19 AM, Andreas Mueller <t3k...@gmail.com > >>>>> <mailto:t3k...@gmail.com>> wrote: > >>>>> > >>>>> I was expecting there to be the actual poisson loss > implemented in > >>>>> the class, not just a log transform. > >>>>> > >>>>> > >>>>> > >>>>> On 07/28/2015 02:03 PM, josef.p...@gmail.com > >>>>> <mailto:josef.p...@gmail.com> wrote: > >>>>>> Just a comment from the statistics sidelines > >>>>>> > >>>>>> taking log of target and fitting a linear or other model > doesn't > >>>>>> make it into a Poisson model. > >>>>>> > >>>>>> But maybe "Poisson loss" in machine learning is unrelated > to the > >>>>>> Poisson distribution or a Poisson model with E(y| x) = > exp(x beta). ? > >>>>>> > >>>>>> Josef > >>>>>> > >>>>>> > >>>>>> On Tue, Jul 28, 2015 at 2:46 PM, Andreas Mueller > >>>>>> <t3k...@gmail.com <mailto:t3k...@gmail.com>> wrote: > >>>>>> > >>>>>> I'd be happy with adding Poisson loss to more models, > thought > >>>>>> I think it would be more natural to first add it to GLM > >>>>>> before GBM ;) > >>>>>> If the addition is straight-forward, I think it would > be a > >>>>>> nice contribution nevertheless. > >>>>>> 1) for the user to do np.exp(gbmpoisson.predict(X)) is > not > >>>>>> acceptable. This needs to be automatic. It would be > best if > >>>>>> this could be done in a minimally intrusive way. > >>>>>> > >>>>>> 2) I'm not sure, maybe Peter can comment? > >>>>>> > >>>>>> 3) I would rather contribute sooner, but other might > thing > >>>>>> differently. Silently ignoring sample weights is not an > >>>>>> option, but you can error if they are provided. > >>>>>> > >>>>>> Hth, > >>>>>> Andy > >>>>>> > >>>>>> > >>>>>> On 07/23/2015 08:52 PM, Peter Rickwood wrote: > >>>>>>> Hello sklearn developers, > >>>>>>> > >>>>>>> I'd like the GBM implementation in sklearn to support > >>>>>>> Poisson loss, and I'm comfortable in writing the code > (I > >>>>>>> have modified my local sklearn source already and am > using > >>>>>>> Poisson loss GBM's). > >>>>>>> > >>>>>>> The sklearn site says to get in touch via this list > before > >>>>>>> making a contribution, so is it worth me to submitting > >>>>>>> something along these lines? > >>>>>>> > >>>>>>> If the answer is yes, some quick questions: > >>>>>>> > >>>>>>> 1) The simplest implementation of poisson loss GBMs is > to > >>>>>>> work in log-space (i.e. the GBM predicts log(target) > rather > >>>>>>> than target), and require the user to then take the > >>>>>>> exponential of those predictions. So, you would need > to do > >>>>>>> something like: > >>>>>>> gbmpoisson = > >>>>>>> sklearn.ensemble.GradientBoostingRegressor(...) > >>>>>>> gbmpoisson.fit(X,y) > >>>>>>> preds = np.exp(predict(X)) > >>>>>>> I am comfortable making changes to the source for this > to > >>>>>>> work, but I'm not comfortable changing any of the > >>>>>>> higher-level interface to deal automatically with the > >>>>>>> transform. In other words, other developers would need > to > >>>>>>> either be OK with the GBM returning transformed > predictions > >>>>>>> in the case where "poisson" loss is chosen, or would > need to > >>>>>>> change code in the 'predict' function to automatically > do > >>>>>>> the transformation is poisson loss was specified. Is > this OK? > >>>>>>> 2) If I do contribute, can you advise what the best > tests > >>>>>>> are to test/validate GBM loss functions before they are > >>>>>>> considered to 'work'? > >>>>>>> > >>>>>>> 3) Allowing for weighted samples is in theory easy > enough to > >>>>>>> implement, but is not something I have implemented > yet. Is > >>>>>>> it better to contribute code sooner that doesn't handle > >>>>>>> weighting (i.e. just ignores sample weights), or later > that > >>>>>>> does? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Cheers, and thanks for all your work on sklearn. > Fantastic > >>>>>>> tool/library, > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Peter > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > > ------------------------------------------------------------------------------ > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Scikit-learn-general mailing list > >>>>>>> Scikit-learn-general@lists.sourceforge.net <mailto: > Scikit-learn-general@lists.sourceforge.net> > >>>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >>>>>> > > ------------------------------------------------------------------------------ > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Scikit-learn-general mailing list > >>>>>> Scikit-learn-general@lists.sourceforge.net > >>>>>> <mailto:Scikit-learn-general@lists.sourceforge.net> > >>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > > ------------------------------------------------------------------------------ > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Scikit-learn-general mailing list > >>>>>> Scikit-learn-general@lists.sourceforge.net <mailto: > Scikit-learn-general@lists.sourceforge.net> > >>>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >>>>> > > ------------------------------------------------------------------------------ > >>>>> > >>>>> _______________________________________________ > >>>>> Scikit-learn-general mailing list > >>>>> Scikit-learn-general@lists.sourceforge.net > >>>>> <mailto:Scikit-learn-general@lists.sourceforge.net> > >>>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > ------------------------------------------------------------------------------ > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Scikit-learn-general mailing list > >>>>> Scikit-learn-general@lists.sourceforge.net > >>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >>> > ------------------------------------------------------------------------------ > >>> _______________________________________________ > >>> Scikit-learn-general mailing list > >>> Scikit-learn-general@lists.sourceforge.net > >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >>> > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > -- > Jan Hendrik Metzen, Dr.rer.nat. > Team Leader of Team "Sustained Learning" > > Universität Bremen und DFKI GmbH, Robotics Innovation Center > FB 3 - Mathematik und Informatik > AG Robotik > Robert-Hooke-Straße 1 > 28359 Bremen, Germany > > > Tel.: +49 421 178 45-4123 > Zentrale: +49 421 178 45-6611 > Fax: +49 421 178 45-4150 > E-Mail: j...@informatik.uni-bremen.de > Homepage: http://www.informatik.uni-bremen.de/~jhm/ > > Weitere Informationen: http://www.informatik.uni-bremen.de/robotik > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general