Re: [Scikit-learn-general] Question about KernelDensity implementation

Michael Eickenberg Wed, 05 Nov 2014 07:41:09 -0800

On Wed, Nov 5, 2014 at 1:52 PM, Kyle Kastner <kastnerk...@gmail.com> wrote:


> In addition to the y=None thing, KDE doesn't have a transform or predict
> method - and I don't think Pipeline supports score or score_samples.
>

That may have been the crucial thing I have missed :) -- Indeed KDE would
have to be at the end of the pipeline, because it doesn't do any
transforming - one can imagine it preceded by a Scaler as in José's example
or e.g. PCA. Pipeline does implement a direct scoring in
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py#L193
which passes through all the preceding transformations and then calls score
on the last one, so that should be OK


> Maybe someone can comment on this, but I don't think KDE is typically used
> in a pipeline.
>
> In this particular case the code *seems* reasonable (and I am surprised it
> doesn't work!), but I don't know much about the KDE stuff. Maybe a bug?
>
> On Wed, Nov 5, 2014 at 7:44 AM, Michael Eickenberg <
> michael.eickenb...@gmail.com> wrote:
>
>> Hi José,
>>
>> yes, there seems to be an inconsistency, KernelDensity.fit has signature
>> (self, X) and not (self, X, y=None) as is usually the case even if y is
>> never used, see
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/neighbors/kde.py#L113
>>
>> I think the generally accepted way of remedying this is to just add
>> y=None in the signature of that function, as was done e.g. for PCA, see
>> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/decomposition/pca.py#L206
>>
>> But maybe I am missing something crucial. Happy to make the PR if I am
>> right about this.
>>
>> Michael
>>
>> On Wed, Nov 5, 2014 at 1:35 PM, José Guilherme Camargo de Souza <
>> jose.camargo.so...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Is the KernelDensity estimator compatible with pipelines? When I try
>>> to use it inside one
>>>
>>>     pipe1 = make_pipeline(StandardScaler(with_mean=True, with_std=True),
>>>                           KernelDensity(algorithm="auto",
>>> kernel="gaussian", metric="euclidean"))
>>>     params = dict(kerneldensity__bandwidth=np.logspace(-10, 1, 100))
>>>     search = GridSearchCV(pipe1, param_grid=params, verbose=1, n_jobs=8,
>>> cv=5)
>>>     search.fit(feats1)
>>>     search.best_estimator_
>>>
>>> I get a TypeError as follows:
>>>
>>> /home/desouza/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc
>>> in fit(self=Pipeline(steps=[('standardscaler',
>>> StandardScale...euclidean',
>>>        metric_params=None, rtol=0))]), X=array([[  5.701     ,
>>> 73.6443    ,  61.7018    ...2.7188    ,
>>>           0.18243243,   0.21621622]]), y=None, **fit_params={})
>>>     125     def fit(self, X, y=None, **fit_params):
>>>     126         """Fit all the transforms one after the other and
>>> transform the
>>>     127         data, then fit the transformed data using the final
>>> estimator.
>>>     128         """
>>>     129         Xt, fit_params = self._pre_transform(X, y, **fit_params)
>>> --> 130         self.steps[-1][-1].fit(Xt, y, **fit_params)
>>>     131         return self
>>>     132
>>>     133     def fit_transform(self, X, y=None, **fit_params):
>>>     134         """Fit all the transforms one after the other and
>>> transform the
>>>
>>> TypeError: fit() takes exactly 2 arguments (3 given)
>>>
>>> Is this an issue or it is supposed not to be compatible? A quick
>>> search in the mailing list and on stackoverflow did not return any
>>> entry about this.
>>>
>>> Thanks,
>>> José
>>>
>>>
>>> On Tue, Oct 21, 2014 at 3:03 PM, Jacob Vanderplas
>>> <jake...@cs.washington.edu> wrote:
>>> > Hi Jose,
>>> > The KDE implementation does work on multivariate data, and will in
>>> general
>>> > work for multimodal data as well. There are two caveats to that:
>>> >
>>> > 1. In the sklearn implementation, the bandwidth must be the same
>>> across each
>>> > dimension. If this poses a problem for your data, the data can be
>>> scaled
>>> > before the fit (Using StandardScaler or something similar).
>>> > 2. The results will depend strongly on the choice of bandwidth: it's
>>> > important to cross-validate to determine the optimal bandwidth, as is
>>> done
>>> > in
>>> >
>>> http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html
>>> >
>>> > Good luck!
>>> >   Jake
>>> >
>>> >
>>> >  Jake VanderPlas
>>> >  Director of Research – Physical Sciences
>>> >  eScience Institute, University of Washington
>>> >  http://www.vanderplas.com
>>> >
>>> > On Tue, Oct 21, 2014 at 2:09 AM, José Guilherme Camargo de Souza
>>> > <jose.camargo.so...@gmail.com> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I would like to ask if the density estimation implementation of scikit
>>> >> works with multivariate multimodal data. In the digits example [1] it
>>> >> is clear that it supports multivariate datasets and in the guide
>>> >> description [2] a 1-D bimodal distribution is used.
>>> >>
>>> >> Is it possible to use the same implementation on multivariate
>>> >> gaussian-shaped data with more than 2 modes? If so, are there any
>>> >> shortcomings or useful tips when doing that?
>>> >>
>>> >> Thanks in advance,
>>> >> José
>>> >>
>>> >> [1]
>>> >>
>>> http://scikit-learn.org/stable/auto_examples/neighbors/plot_digits_kde_sampling.html#example-neighbors-plot-digits-kde-sampling-py
>>> >> [2]
>>> >>
>>> http://scikit-learn.org/stable/modules/density.html#kernel-density-estimation
>>> >> José Guilherme
>>> >>
>>> >>
>>> >>
>>> ------------------------------------------------------------------------------
>>> >> Comprehensive Server Monitoring with Site24x7.
>>> >> Monitor 10 servers for $9/Month.
>>> >> Get alerted through email, SMS, voice calls or mobile push
>>> notifications.
>>> >> Take corrective actions from your mobile device.
>>> >> http://p.sf.net/sfu/Zoho
>>> >> _______________________________________________
>>> >> Scikit-learn-general mailing list
>>> >> Scikit-learn-general@lists.sourceforge.net
>>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >
>>> >
>>> >
>>> >
>>> ------------------------------------------------------------------------------
>>> > Comprehensive Server Monitoring with Site24x7.
>>> > Monitor 10 servers for $9/Month.
>>> > Get alerted through email, SMS, voice calls or mobile push
>>> notifications.
>>> > Take corrective actions from your mobile device.
>>> > http://p.sf.net/sfu/Zoho
>>> > _______________________________________________
>>> > Scikit-learn-general mailing list
>>> > Scikit-learn-general@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Question about KernelDensity implementation

Reply via email to