Re: [Scikit-learn-general] why Gaussian Processes is not good for high-dimensional problems

Aman Thakral Mon, 26 Mar 2012 09:13:54 -0700

On Mon, Mar 26, 2012 at 12:02 PM, Vincent Dubourg <[email protected]
> wrote:


> Hi,
>
> As the author of the quoted sentence from the documentation, I must say
> that this is a bit personal wrt to my own experience/goals and it needs to
> be corrected with more objective facts (eg about the computational
> complexity as Mathieu mentionned). By "efficiency", I meant prediction
> power in terms of score... But this is a fact: regression in
> high-dimensional spaces is hard anyway. Or it requires many samples.
>
> I agree that the computational complexity suffers actually more from
> n_samples than n_features. It makes sense. However I think that for
> anisotropic kernels (ie when you use componentwise tensor product kernels),
> n_features does have an influence on the prediction time, doesn't it ?
>
> So, Tao, do not hesitate to use GPML for high-dimensional problems!... But
> do not expect better (or worse) performance than Support Vector
> Regression... IMHO, with a good fitting technique for both predictors you
> can achieve similar scores... except GPML's kernels have this anisotropy
> feature which can make the difference on your data!? Tell us!
>
> Cheers,
> Vincent
>
> 2012/3/26 Tao-wei Huang <[email protected]>
>
>> Hi Mathieu,
>>
>> Thank you for your reply. If it's expensive in terms of sample size, it
>> totally make senses for me. However, I am still confused by the statement
>> in scikit-learn documents:
>>
>> "It loses efficiency in high dimensional spaces – namely when the number
>> of features exceeds a few dozens. It might indeed give poor performance and
>> it loses computational efficiency."
>> http://scikit-learn.org/stable/modules/gaussian_process.html
>>
>> Even if 'the number of features' here refers to the number of sample
>> size, I don't think the model would be inefficient just with sample numbers
>> over dozens. Could you or anyone else make it clear for me please? thanks!
>>
>> Cheers,
>> Tao
>>
>>
>>
>>
>> On Mon, Mar 26, 2012 at 10:24 AM, Mathieu Blondel 
>> <[email protected]>wrote:
>>
>>> If I'm not mistaken, Gaussian Processes are expensive for large
>>> n_samples, not for large n_features. The reason is because the kernel
>>> matrix (called covariance matrix in the GP literature) needs to be
>>> inversed, which takes O(n_samples^3) complexity with a Cholesky
>>> decomposition. That said, kernels methods like SVMs or Gaussian Processes
>>> are usually not used much with high-dimensional data. Kernels are useful to
>>> implicitly project low-dimensional data to higher (even infinite)
>>> dimensional spaces. If your data is already high-dimensional, there's
>>> nothing to gain from using kernels. A good example is text classification,
>>> where everyone is using linear kernels.
>>>
>>> HTH,
>>> Mathieu
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

Hi,

I have experience using PLS for high dimensional regression (which is now
part of scikit-learn), with relatively few observations, and my results
have been promising.  I've also written a PLS algorithm which uses pandas
that I have used to solve several problems in my domain (examining the
effects of weather on crop disease and crop yield). PLS has been used a lot
in chemometrics, as well as for analyzing DNA microarray data (which is
very high dimensional with very few observations), and some applications is
neuroscience. If you're interested, I can try to dig up some resources.

Cheers,
Aman

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] why Gaussian Processes is not good for high-dimensional problems

Reply via email to