2013/7/8 Skipper Seabold <jsseab...@gmail.com>:
> On Mon, Jul 8, 2013 at 1:20 PM, Bertrand Thirion
> <bertrand.thir...@inria.fr> wrote:
>>
>> De: "Jacob Vanderplas" <jake...@cs.washington.edu>
>> À: scikit-learn-general@lists.sourceforge.net
>> Envoyé: Dimanche 7 Juillet 2013 19:10:38
>> Objet: [Scikit-learn-general] Defining a Density Estimation Interface
>>
>>
>> Hi,
>> I've been working on a big rewrite of the Ball Tree and KD Tree in 
>> sklearn.neighbors [0], and one of the enhancements is a fast Kernel Density 
>> estimation routine.  As part of the PR, I've created a KernelDensity class 
>> to wrap this functionality.  For the initial pass at the interface, I've 
>> used the same method names used in sklearn.mixture.GMM, which (I believe) is 
>> the only other density estimation routine we currently have.  In particular, 
>> I've defined these methods:
>>
>> - fit(X) -- fit the model
>> - eval(X) -- compute the log-probability (i.e. normalized density) under the 
>> model at positions X
>> - score(X) -- compute the log-likelihood of a set of data X under the model
>> - sample(n_samples) -- draw random samples from the underlying density model
>>
>> Olivier suggested that perhaps ``eval`` is too generic a name, and should 
>> instead be something more specific (logprobability? loglikelihood? 
>> predict_loglikelihood? something else?)
>>
>> Sounds good to me. As a matter of taste, I like `log_likelihood`, which 
>> would be a synonym of `eval` in that case (as a second choice, log_density 
>> rather than log_probability) ?
>
> Why not conform to the already existing distributions interface in
> scipy.stats? That's what we did with statsmodels. These are mostly
> univariate distributions in scipy, but I think it generalized ok to
> the multivariate density estimators and kernel regression models we
> now have.
>
> http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html#scipy.stats.rv_continuous
> https://github.com/statsmodels/statsmodels/tree/master/statsmodels/nonparametric
> http://statsmodels.sourceforge.net/devel/nonparametric.html
>
> Then you'd have pdf, logpdf, cdf, logcdf, sf, rvs (not wild about this
> one, and I think we use sample in places), etc.

I am not found of acronyms, especially when there not common at all
such as `rvs`. I think RVS stands for Random Variable Samples but it's
documented nowhere in the SciPy documentation. `sample` is a much more
descriptive and intuitive method name (explicit is better that
implicit.

pdf, logpdf, cdf, logcdf are OK-ish names as those acronyms are very
common. But `density`, `log_density`, `cumulative_density` and
`log_cumulative_density` are even more explicit hence user-friendly
IMHO.

I am not sure what `sf` stands for, so it's probably a poor choice as
we should not make the assumption that the library users will be well
versed in stats acronyms.

scikit-learn is often used by people new to stats and machine
learning, hence we should try to be careful at picking up explicit
names.

> Would it break too much the Pipeline interface in scikit-learn? If
> not, I prefer rather to call things what they are. In any event, I
> agree that eval is too generic and I'd add that the score function of
> a distribution has a specific meaning already for parameterized
> distributions.

The Pipeline cannot use density estimate so far. It uses a generic
"transform" method on the intermediate steps (typically to perform
feature extraction, selection, projections or non-linear embeddings).

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to