Re: [Scikit-learn-general] Defining a Density Estimation Interface

Skipper Seabold Mon, 08 Jul 2013 10:42:31 -0700

On Mon, Jul 8, 2013 at 1:20 PM, Bertrand Thirion
<bertrand.thir...@inria.fr> wrote:
>
> De: "Jacob Vanderplas" <jake...@cs.washington.edu>
> À: scikit-learn-general@lists.sourceforge.net
> Envoyé: Dimanche 7 Juillet 2013 19:10:38
> Objet: [Scikit-learn-general] Defining a Density Estimation Interface
>
>
> Hi,
> I've been working on a big rewrite of the Ball Tree and KD Tree in 
> sklearn.neighbors [0], and one of the enhancements is a fast Kernel Density 
> estimation routine.  As part of the PR, I've created a KernelDensity class to 
> wrap this functionality.  For the initial pass at the interface, I've used 
> the same method names used in sklearn.mixture.GMM, which (I believe) is the 
> only other density estimation routine we currently have.  In particular, I've 
> defined these methods:
>
> - fit(X) -- fit the model
> - eval(X) -- compute the log-probability (i.e. normalized density) under the 
> model at positions X
> - score(X) -- compute the log-likelihood of a set of data X under the model
> - sample(n_samples) -- draw random samples from the underlying density model
>
> Olivier suggested that perhaps ``eval`` is too generic a name, and should 
> instead be something more specific (logprobability? loglikelihood? 
> predict_loglikelihood? something else?)
>
> Sounds good to me. As a matter of taste, I like `log_likelihood`, which would 
> be a synonym of `eval` in that case (as a second choice, log_density rather 
> than log_probability) ?


Why not conform to the already existing distributions interface in
scipy.stats? That's what we did with statsmodels. These are mostly
univariate distributions in scipy, but I think it generalized ok to
the multivariate density estimators and kernel regression models we
now have.

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous
https://github.com/statsmodels/statsmodels/tree/master/statsmodels/nonparametric
http://statsmodels.sourceforge.net/devel/nonparametric.html

Then you'd have pdf, logpdf, cdf, logcdf, sf, rvs (not wild about this
one, and I think we use sample in places), etc.

Would it break too much the Pipeline interface in scikit-learn? If
not, I prefer rather to call things what they are. In any event, I
agree that eval is too generic and I'd add that the score function of
a distribution has a specific meaning already for parameterized
distributions.

fwiw,

Skipper

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Defining a Density Estimation Interface

Reply via email to