2013/7/8 Skipper Seabold <jsseab...@gmail.com>: > On Mon, Jul 8, 2013 at 1:20 PM, Bertrand Thirion > <bertrand.thir...@inria.fr> wrote: >> >> De: "Jacob Vanderplas" <jake...@cs.washington.edu> >> À: scikit-learn-general@lists.sourceforge.net >> Envoyé: Dimanche 7 Juillet 2013 19:10:38 >> Objet: [Scikit-learn-general] Defining a Density Estimation Interface >> >> >> Hi, >> I've been working on a big rewrite of the Ball Tree and KD Tree in >> sklearn.neighbors [0], and one of the enhancements is a fast Kernel Density >> estimation routine. As part of the PR, I've created a KernelDensity class >> to wrap this functionality. For the initial pass at the interface, I've >> used the same method names used in sklearn.mixture.GMM, which (I believe) is >> the only other density estimation routine we currently have. In particular, >> I've defined these methods: >> >> - fit(X) -- fit the model >> - eval(X) -- compute the log-probability (i.e. normalized density) under the >> model at positions X >> - score(X) -- compute the log-likelihood of a set of data X under the model >> - sample(n_samples) -- draw random samples from the underlying density model >> >> Olivier suggested that perhaps ``eval`` is too generic a name, and should >> instead be something more specific (logprobability? loglikelihood? >> predict_loglikelihood? something else?) >> >> Sounds good to me. As a matter of taste, I like `log_likelihood`, which >> would be a synonym of `eval` in that case (as a second choice, log_density >> rather than log_probability) ? > > Why not conform to the already existing distributions interface in > scipy.stats? That's what we did with statsmodels. These are mostly > univariate distributions in scipy, but I think it generalized ok to > the multivariate density estimators and kernel regression models we > now have. > > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous > https://github.com/statsmodels/statsmodels/tree/master/statsmodels/nonparametric > http://statsmodels.sourceforge.net/devel/nonparametric.html > > Then you'd have pdf, logpdf, cdf, logcdf, sf, rvs (not wild about this > one, and I think we use sample in places), etc.
I am not found of acronyms, especially when there not common at all such as `rvs`. I think RVS stands for Random Variable Samples but it's documented nowhere in the SciPy documentation. `sample` is a much more descriptive and intuitive method name (explicit is better that implicit. pdf, logpdf, cdf, logcdf are OK-ish names as those acronyms are very common. But `density`, `log_density`, `cumulative_density` and `log_cumulative_density` are even more explicit hence user-friendly IMHO. I am not sure what `sf` stands for, so it's probably a poor choice as we should not make the assumption that the library users will be well versed in stats acronyms. scikit-learn is often used by people new to stats and machine learning, hence we should try to be careful at picking up explicit names. > Would it break too much the Pipeline interface in scikit-learn? If > not, I prefer rather to call things what they are. In any event, I > agree that eval is too generic and I'd add that the score function of > a distribution has a specific meaning already for parameterized > distributions. The Pipeline cannot use density estimate so far. It uses a generic "transform" method on the intermediate steps (typically to perform feature extraction, selection, projections or non-linear embeddings). -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general