hi, I feel the same as Jake.
Hadayat I looked at your code and there is a fair bit of refactoring to be done to fit with the scikit-learn API. I would encourage you to do this refactoring and try adapting the current GP examples to see how it compares in terms of speed, results and code readability. Then I see 2 options: - we start a deprecation cycle introducing new GaussianProcessRegressor and GaussianProcessClassifier objects to avoid too much of a mess - we advertise this code as sklearn third party on the wiki. Alex On Sun, Jan 5, 2014 at 6:00 AM, Jacob Vanderplas <jake...@cs.washington.edu> wrote: > Hi, > I would tentatively be in favor of this, though I haven't yet looked closely > at the proposed code. I have found sklearn's gaussian process module to be > very opaque, to the point of being unusable. I've ended up spinning up my > own implementation for several applications, and I know several others who > have done the same. Having documentation filled with terms like "Universal > Kriging" and "Nugget" doesn't do much to help. A sane, well-designed GP > approach based in a machine learning framework, and documented with standard > machine learning terminology, would be a very valuable addition to sklearn, > IMO, > Jake > > > On Sat, Jan 4, 2014 at 8:30 PM, Hadayat Seddiqi <had...@gmail.com> wrote: >> >> Hello everyone, >> >> I've noticed that the Gaussian Process (GP, for short) module for sklearn >> is a bit outdated. Some months ago I tried to use it but found it unsuitable >> for my needs, so I wrote a new code based on GPML (written in MATLAB by Carl >> Rasmussen and others) but in Python. Since then it has grown and been >> polished, and I want to contribute the code as a module to sklearn. I'd like >> to start a discussion about this now. >> >> I imagine I might be stepping on some toes coming here and proposing that >> we delete this other code and use mine instead. I want to get everyone's >> opinions on what I would need to do to get this suitable for sklearn, if >> anyone is even interested in the first place. The last thing I want to be is >> presumptuous here, but I'd be really interested in contributing my work so >> of course I'm willing to do all the leg work. >> >> That was my tl;dr, but here are some reasons why I think my code might be >> better for sklearn: >> >> -The GPML code (http://www.gaussianprocess.org/gpml/code/matlab/doc/) is >> based on algorithms given in the GPML textbook, which is the de facto >> standard text on Gaussian processes for machine learning >> (http://www.gaussianprocess.org/gpml/). There are two reasons this is good: >> --The textbook itself can serve as an extended documentation of the >> theory, which is extremely useful to people who are new to GP models. I've >> commented my code heavily to refer to this textbook with the appropriate >> equations, algorithms, and section/chapters. >> --It is kept up-to-date by the owners with new developments from them and >> their students. They are an active academic group, Cambridge ML group, as >> well as others, so there's a constant feed of useful features from the >> leaders in this field. >> >> -My Python version, PyGPML (https://github.com/hadsed/PyGPML) is written >> in a slightly more sensible way than GPML in MATLAB (not the fault of the >> programmers I think, it's just MATLAB..) that allows extension in a very >> simple way. Each function of the model can be a built-in one, or a >> custom-defined function that can easily be passed to the main GP object. I >> think the code is quite clear, but in case it's not I can write some >> documentation giving examples on how to extend it. >> >> -On the other hand, the sklearn.gaussian_process module hasn't been >> updated in quite a while beyond some small corrections. I had a pull request >> open (awaiting changes by me) to extend the module to allow multiple >> hyperparameter training (which is a pretty significant feature that was >> missing). It was a little more difficult for me to understand the code >> reading the original MATLAB DACE toolbox docs so I'd given up at this point, >> but this might just be my own lack of sophistication in stochastic DEs. In >> any case I thought GPML was much clearer, probably because it was given in >> the context of modern machine learning instead of the geophysical literature >> where GPs are known as "kriging". In addition to these two problems, one >> cannot add a new optimization method for maximum likelihood without some >> trouble, whereas in my code it is trivial--just change the argument to >> train() (which calls scipy.optimize.minimize()). Unfortunately I haven't >> done any experiments to test speed between the two, but it should be known >> that the complexity of the GPML algorithm is quite clear. Perhaps this would >> be a good first step. >> >> * * * >> >> Anyway, these are just the more important reasons I'd like to see this >> code be used in sklearn. I'd be really interested in everyone's opinions on >> this idea, if this is appropriate to do for sklearn, and what next steps I >> should take to make it suitable in the case that there is interest. >> >> Thanks everyone, >> >> Had Seddiqi >> >> >> ------------------------------------------------------------------------------ >> Rapidly troubleshoot problems before they affect your business. Most IT >> organizations don't have a clear picture of how application performance >> affects their revenue. With AppDynamics, you get 100% visibility into your >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >> Pro! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general