hi,

I feel the same as Jake.

Hadayat I looked at your code and there is a fair bit of refactoring
to be done to fit with the scikit-learn API.

I would encourage you to do this refactoring and try adapting
the current GP examples to see how it compares in terms of
speed, results and code readability.

Then I see 2 options:

- we start a deprecation cycle introducing new GaussianProcessRegressor
and GaussianProcessClassifier objects to avoid too much of a mess

- we advertise this code as sklearn third party on the wiki.

Alex


On Sun, Jan 5, 2014 at 6:00 AM, Jacob Vanderplas
<jake...@cs.washington.edu> wrote:
> Hi,
> I would tentatively be in favor of this, though I haven't yet looked closely
> at the proposed code.  I have found sklearn's gaussian process module to be
> very opaque, to the point of being unusable.  I've ended up spinning up my
> own implementation for several applications, and I know several others who
> have done the same.  Having documentation filled with terms like "Universal
> Kriging" and "Nugget" doesn't do much to help.  A sane, well-designed GP
> approach based in a machine learning framework, and documented with standard
> machine learning terminology, would be a very valuable addition to sklearn,
> IMO,
>   Jake
>
>
> On Sat, Jan 4, 2014 at 8:30 PM, Hadayat Seddiqi <had...@gmail.com> wrote:
>>
>> Hello everyone,
>>
>> I've noticed that the Gaussian Process (GP, for short) module for sklearn
>> is a bit outdated. Some months ago I tried to use it but found it unsuitable
>> for my needs, so I wrote a new code based on GPML (written in MATLAB by Carl
>> Rasmussen and others) but in Python. Since then it has grown and been
>> polished, and I want to contribute the code as a module to sklearn. I'd like
>> to start a discussion about this now.
>>
>> I imagine I might be stepping on some toes coming here and proposing that
>> we delete this other code and use mine instead. I want to get everyone's
>> opinions on what I would need to do to get this suitable for sklearn, if
>> anyone is even interested in the first place. The last thing I want to be is
>> presumptuous here, but I'd be really interested in contributing my work so
>> of course I'm willing to do all the leg work.
>>
>> That was my tl;dr, but here are some reasons why I think my code might be
>> better for sklearn:
>>
>> -The GPML code (http://www.gaussianprocess.org/gpml/code/matlab/doc/) is
>> based on algorithms given in the GPML textbook, which is the de facto
>> standard text on Gaussian processes for machine learning
>> (http://www.gaussianprocess.org/gpml/). There are two reasons this is good:
>> --The textbook itself can serve as an extended documentation of the
>> theory, which is extremely useful to people who are new to GP models. I've
>> commented my code heavily to refer to this textbook with the appropriate
>> equations, algorithms, and section/chapters.
>> --It is kept up-to-date by the owners with new developments from them and
>> their students. They are an active academic group, Cambridge ML group, as
>> well as others, so there's a constant feed of useful features from the
>> leaders in this field.
>>
>> -My Python version, PyGPML (https://github.com/hadsed/PyGPML) is written
>> in a slightly more sensible way than GPML in MATLAB (not the fault of the
>> programmers I think, it's just MATLAB..) that allows extension in a very
>> simple way. Each function of the model can be a built-in one, or a
>> custom-defined function that can easily be passed to the main GP object. I
>> think the code is quite clear, but in case it's not I can write some
>> documentation giving examples on how to extend it.
>>
>> -On the other hand, the sklearn.gaussian_process module hasn't been
>> updated in quite a while beyond some small corrections. I had a pull request
>> open (awaiting changes by me) to extend the module to allow multiple
>> hyperparameter training (which is a pretty significant feature that was
>> missing). It was a little more difficult for me to understand the code
>> reading the original MATLAB DACE toolbox docs so I'd given up at this point,
>> but this might just be my own lack of sophistication in stochastic DEs. In
>> any case I thought GPML was much clearer, probably because it was given in
>> the context of modern machine learning instead of the geophysical literature
>> where GPs are known as "kriging". In addition to these two problems, one
>> cannot add a new optimization method for maximum likelihood without some
>> trouble, whereas in my code it is trivial--just change the argument to
>> train() (which calls scipy.optimize.minimize()). Unfortunately I haven't
>> done any experiments to test speed between the two, but it should be known
>> that the complexity of the GPML algorithm is quite clear. Perhaps this would
>> be a good first step.
>>
>> * * *
>>
>> Anyway, these are just the more important reasons I'd like to see this
>> code be used in sklearn. I'd be really interested in everyone's opinions on
>> this idea, if this is appropriate to do for sklearn, and what next steps I
>> should take to make it suitable in the case that there is interest.
>>
>> Thanks everyone,
>>
>> Had Seddiqi
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to