Hello everyone,

I've noticed that the Gaussian Process (GP, for short) module for sklearn
is a bit outdated. Some months ago I tried to use it but found it
unsuitable for my needs, so I wrote a new code based on GPML (written in
MATLAB by Carl Rasmussen and others) but in Python. Since then it has grown
and been polished, and I want to contribute the code as a module to
sklearn. I'd like to start a discussion about this now.

I imagine I might be stepping on some toes coming here and proposing that
we delete this other code and use mine instead. I want to get everyone's
opinions on what I would need to do to get this suitable for sklearn, if
anyone is even interested in the first place. The last thing I want to be
is presumptuous here, but I'd be really interested in contributing my work
so of course I'm willing to do all the leg work.

That was my tl;dr, but here are some reasons why I think my code might be
better for sklearn:

-The GPML code (http://www.gaussianprocess.org/gpml/code/matlab/doc/) is
based on algorithms given in the GPML textbook, which is the de facto
standard text on Gaussian processes for machine learning (
http://www.gaussianprocess.org/gpml/). There are two reasons this is good:
--The textbook itself can serve as an extended documentation of the theory,
which is extremely useful to people who are new to GP models. I've
commented my code heavily to refer to this textbook with the appropriate
equations, algorithms, and section/chapters.
--It is kept up-to-date by the owners with new developments from them and
their students. They are an active academic group, Cambridge ML group, as
well as others, so there's a constant feed of useful features from the
leaders in this field.

-My Python version, PyGPML (https://github.com/hadsed/PyGPML) is written in
a slightly more sensible way than GPML in MATLAB (not the fault of the
programmers I think, it's just MATLAB..) that allows extension in a very
simple way. Each function of the model can be a built-in one, or a
custom-defined function that can easily be passed to the main GP object. I
think the code is quite clear, but in case it's not I can write some
documentation giving examples on how to extend it.

-On the other hand, the sklearn.gaussian_process module hasn't been updated
in quite a while beyond some small corrections. I had a pull request open
(awaiting changes by me) to extend the module to allow multiple
hyperparameter training (which is a pretty significant feature that was
missing). It was a little more difficult for me to understand the code
reading the original MATLAB DACE toolbox docs so I'd given up at this
point, but this might just be my own lack of sophistication in stochastic
DEs. In any case I thought GPML was much clearer, probably because it was
given in the context of modern machine learning instead of the geophysical
literature where GPs are known as "kriging". In addition to these two
problems, one cannot add a new optimization method for maximum likelihood
without some trouble, whereas in my code it is trivial--just change the
argument to train() (which calls scipy.optimize.minimize()). Unfortunately
I haven't done any experiments to test speed between the two, but it should
be known that the complexity of the GPML algorithm is quite clear. Perhaps
this would be a good first step.

* * *

Anyway, these are just the more important reasons I'd like to see this code
be used in sklearn. I'd be really interested in everyone's opinions on this
idea, if this is appropriate to do for sklearn, and what next steps I
should take to make it suitable in the case that there is interest.

Thanks everyone,

Had Seddiqi
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to