On Tue, Feb 19, 2013 at 7:55 PM, Lars Buitinck <l.j.buiti...@uva.nl> wrote: > 2013/2/19 James Bergstra <james.bergs...@gmail.com>: >> Further to this: I started a project on github to look at how to >> combine hyperopt with sklearn. >> https://github.com/jaberg/hyperopt-sklearn >> >> I've only wrapped on algorithm so far: Perceptron >> https://github.com/jaberg/hyperopt-sklearn/blob/master/hpsklearn/perceptron.py >> >> My idea is that little files like perceptron.py would encode >> (a) domain expertise about what values make sense for a particular >> hyper-parameter (see the `search_space()` function and >> (b) a sklearn-style fit/predict interface that encapsulates search >> over those hyper-parameters (see `AutoPerceptron`) > > I'm not sure what your long-term goals with this project are, but I > see three problems with this approach: > 1. The values might be problem-dependent rather than estimator > dependent. In your example, you're optimizing for accuracy, but you > might want to optimize for F1-score instead.
Good point, and if I understand correctly, it's related to your other point below about GridSearch. I think you are pointing out that the design of the AutoPerceptron is off the mark for 2 reasons: 1. There is only one line in that class that actually refers to Perceptron, so why not make the actual estimator a constructor argument? (I agree, it should be an argument.) 2. The class mainly consists of plumbing, but also is hard-coded to compute classification error. This is silly, it would be better to use either (a) the native loss of the estimator or else (b) some specific user-supplied validation metric. I agree with both of these points. Let me know if I misunderstood you though. > 2. The number is estimators is *huge* if you also consider > combinations like SelectKBest(chi2) -> RBFSamples -> SGDClassifier > pipelines (a classifier that I was trying out only yesterday). Yes, the number of estimators in a search space can be huge. In my research on visual system models I found that hyperopt was surprisingly useful, even in the face of daunting configuration problems. The point of this project, for me, is to see how it stacks up. One design aspect that doesn't come through in the current code sample is that the hard-coded parameter spaces (which I'll come to in a second) must compose. What I mean is that if someone has written up a standard SGDClassifier search space, and someone has coded up search spaces for SelectKBest and RBFSamples, then you should be able to just string those all together and search the joint space without much trouble. Your particular case is exactly the sort of case I would hope eventually to address - it's difficult to give sensible defaults to each of those modules before knowing either (a) what kind of data they will process and (b) what's going on in the rest of the pipeline. Playing with a bunch of interacting variables as measured by long-running programs is hard for people; automatic methods don't actually have to be all that efficient to be competitive. > 3. The estimator parameters change sometimes, so this would have to be > kept in sync with scikit-learn. This is a price I was expecting to have to pay, I don't see any way around it. Part of the value of this library is encoding parameter ranges for specific estimators. That tight coupling is not something to be dodged. - James > When I wrote the scikit-learn wrapper for NLTK [1], I chose a strategy > where *no scikit-learn code is imported at all* (except when the user > runs the demo or unit tests). Instead, the user is responsible for > importing it and constructing the appropriate estimator. This makes > the code robust to API changes, and it can handle arbitrarily complex > sklearn.Pipeline objects, as well as estimators that follow the API > conventions but are not in scikit-learn proper. > > I think a similar approach can be followed here. While some > suggestions for parameters to try might be shipped as examples, an > estimator- and evaluation-agnostic wrapper class ("meta-estimator") is > a stronger basis for a package like the one you're writing. > scikit-learn's own GridSearch is also implemented like this, to a > large extent. > > [1] > https://github.com/nltk/nltk/blob/f7f3b73f0f051639d87cfeea43b0aabf6f167b8f/nltk/classify/scikitlearn.py Thanks, yes, there is a strong similarity between what I'm trying to do and GridSearch, so it makes sense to use similar strategies for comparing model outputs. The "AutoPerceptron" class would be improved by being more generic, like GridSearch. - James ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general