Hi,
I would tentatively be in favor of this, though I haven't yet looked
closely at the proposed code. I have found sklearn's gaussian process
module to be very opaque, to the point of being unusable. I've ended up
spinning up my own implementation for several applications, and I know
several others who have done the same. Having documentation filled with
terms like "Universal Kriging" and "Nugget" doesn't do much to help. A
sane, well-designed GP approach based in a machine learning framework, and
documented with standard machine learning terminology, would be a very
valuable addition to sklearn, IMO,
Jake
On Sat, Jan 4, 2014 at 8:30 PM, Hadayat Seddiqi <had...@gmail.com> wrote:
> Hello everyone,
>
> I've noticed that the Gaussian Process (GP, for short) module for sklearn
> is a bit outdated. Some months ago I tried to use it but found it
> unsuitable for my needs, so I wrote a new code based on GPML (written in
> MATLAB by Carl Rasmussen and others) but in Python. Since then it has grown
> and been polished, and I want to contribute the code as a module to
> sklearn. I'd like to start a discussion about this now.
>
> I imagine I might be stepping on some toes coming here and proposing that
> we delete this other code and use mine instead. I want to get everyone's
> opinions on what I would need to do to get this suitable for sklearn, if
> anyone is even interested in the first place. The last thing I want to be
> is presumptuous here, but I'd be really interested in contributing my work
> so of course I'm willing to do all the leg work.
>
> That was my tl;dr, but here are some reasons why I think my code might be
> better for sklearn:
>
> -The GPML code (http://www.gaussianprocess.org/gpml/code/matlab/doc/) is
> based on algorithms given in the GPML textbook, which is the de facto
> standard text on Gaussian processes for machine learning (
> http://www.gaussianprocess.org/gpml/). There are two reasons this is good:
> --The textbook itself can serve as an extended documentation of the
> theory, which is extremely useful to people who are new to GP models. I've
> commented my code heavily to refer to this textbook with the appropriate
> equations, algorithms, and section/chapters.
> --It is kept up-to-date by the owners with new developments from them and
> their students. They are an active academic group, Cambridge ML group, as
> well as others, so there's a constant feed of useful features from the
> leaders in this field.
>
> -My Python version, PyGPML (https://github.com/hadsed/PyGPML) is written
> in a slightly more sensible way than GPML in MATLAB (not the fault of the
> programmers I think, it's just MATLAB..) that allows extension in a very
> simple way. Each function of the model can be a built-in one, or a
> custom-defined function that can easily be passed to the main GP object. I
> think the code is quite clear, but in case it's not I can write some
> documentation giving examples on how to extend it.
>
> -On the other hand, the sklearn.gaussian_process module hasn't been
> updated in quite a while beyond some small corrections. I had a pull
> request open (awaiting changes by me) to extend the module to allow
> multiple hyperparameter training (which is a pretty significant feature
> that was missing). It was a little more difficult for me to understand the
> code reading the original MATLAB DACE toolbox docs so I'd given up at this
> point, but this might just be my own lack of sophistication in stochastic
> DEs. In any case I thought GPML was much clearer, probably because it was
> given in the context of modern machine learning instead of the geophysical
> literature where GPs are known as "kriging". In addition to these two
> problems, one cannot add a new optimization method for maximum likelihood
> without some trouble, whereas in my code it is trivial--just change the
> argument to train() (which calls scipy.optimize.minimize()). Unfortunately
> I haven't done any experiments to test speed between the two, but it should
> be known that the complexity of the GPML algorithm is quite clear. Perhaps
> this would be a good first step.
>
> * * *
>
> Anyway, these are just the more important reasons I'd like to see this
> code be used in sklearn. I'd be really interested in everyone's opinions on
> this idea, if this is appropriate to do for sklearn, and what next steps I
> should take to make it suitable in the case that there is interest.
>
> Thanks everyone,
>
> Had Seddiqi
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general