On Mon, Jul 8, 2013 at 1:20 PM, Bertrand Thirion <bertrand.thir...@inria.fr> wrote: > > De: "Jacob Vanderplas" <jake...@cs.washington.edu> > À: scikit-learn-general@lists.sourceforge.net > Envoyé: Dimanche 7 Juillet 2013 19:10:38 > Objet: [Scikit-learn-general] Defining a Density Estimation Interface > > > Hi, > I've been working on a big rewrite of the Ball Tree and KD Tree in > sklearn.neighbors [0], and one of the enhancements is a fast Kernel Density > estimation routine. As part of the PR, I've created a KernelDensity class to > wrap this functionality. For the initial pass at the interface, I've used > the same method names used in sklearn.mixture.GMM, which (I believe) is the > only other density estimation routine we currently have. In particular, I've > defined these methods: > > - fit(X) -- fit the model > - eval(X) -- compute the log-probability (i.e. normalized density) under the > model at positions X > - score(X) -- compute the log-likelihood of a set of data X under the model > - sample(n_samples) -- draw random samples from the underlying density model > > Olivier suggested that perhaps ``eval`` is too generic a name, and should > instead be something more specific (logprobability? loglikelihood? > predict_loglikelihood? something else?) > > Sounds good to me. As a matter of taste, I like `log_likelihood`, which would > be a synonym of `eval` in that case (as a second choice, log_density rather > than log_probability) ?
Why not conform to the already existing distributions interface in scipy.stats? That's what we did with statsmodels. These are mostly univariate distributions in scipy, but I think it generalized ok to the multivariate density estimators and kernel regression models we now have. http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous https://github.com/statsmodels/statsmodels/tree/master/statsmodels/nonparametric http://statsmodels.sourceforge.net/devel/nonparametric.html Then you'd have pdf, logpdf, cdf, logcdf, sf, rvs (not wild about this one, and I think we use sample in places), etc. Would it break too much the Pipeline interface in scikit-learn? If not, I prefer rather to call things what they are. In any event, I agree that eval is too generic and I'd add that the score function of a distribution has a specific meaning already for parameterized distributions. fwiw, Skipper ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general