Hah, and I just wanted to write regarding the VotingClassifier — I remember my struggle quite well when I tried to to make it pipeline and GridSearch compatible until I figured that one out :P
> On Mar 23, 2016, at 12:34 AM, Joel Nothman <joel.noth...@gmail.com> wrote: > > And I lied that none of the scikit-learn estimators define their own > get_params. Of course the following do: VotingClassifier, Kernel (and > subclasses), Pipeline and FeatureUnion > > On 23 March 2016 at 15:04, Joel Nothman <joel.noth...@gmail.com> wrote: > something like the following may suffice: > > def get_params(self, deep=True): > out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep) > out['w2v_clusters'] = self.w2v_clusters > return out > > On 23 March 2016 at 15:01, Joel Nothman <joel.noth...@gmail.com> wrote: > Hi Fred, > > We use the __init__ signature to get the list of parameters that (a) can be > set by grid search; (b) need to be copied to a cloned instance of the > estimator (with any fitted model discarded) in constructing ensembles, cross > validation, etc. While none of the scikit-learn library of estimators do > this, in practice you can overload get_params to define your own parameter > listing. See > http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params > > On 23 March 2016 at 14:45, Fred Mailhot <fred.mail...@gmail.com> wrote: > Hello list, > > Firstly, thanks for this incredible package; I use it daily at work. Now on > to the meat: I'm trying to subclass TfidfVectorizer and running into issues. > I want to specify an extra param for __init__() that points to a file that > gets used in build_analyzer(). Skipping irrelevant bits, I've got the > following: > > #====================== > class WordCooccurrenceVectorizer(TfidfVectorizer): > > ### override __init__ to add w2v_clusters arg > # see > http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass > # for explanation of syntax > def __init__(self, *args, **kwargs): > try: > self.w2v_cluster_path = kwargs.pop("w2v_clusters") > except KeyError: > pass > super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs) > > def build_analyzer(self): > preprocess = self.build_preprocessor() > stopwords = self.get_stop_words() > w2v_clusters = self.load_w2v_clusters() > tokenize = self.build_tokenizer() > return lambda doc: > self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters) > [...] > #====================== > > I can instantiate this, but when I want to inspect it, I get the following > (this is in ipython, in a script it just hangs): > > #====================== > In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2), > stop_words="english", max_df=0.5, min_df=1, max_features=10000, > w2v_clusters="clusters.20160322_1803.w2v", binary=True) > > In [3]: vec > Out[3]: > --------------------------------------------------------------------------- > RuntimeError Traceback (most recent call last) > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc > in __call__(self, obj) > 697 type_pprinters=self.type_printers, > 698 deferred_pprinters=self.deferred_printers) > --> 699 printer.pretty(obj) > 700 printer.flush() > 701 return stream.getvalue() > > [...] > > /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc > in _get_param_names(cls) > 193 " %s with constructor %s doesn't " > 194 " follow this convention." > --> 195 % (cls, init_signature)) > 196 # Extract and sort argument names excluding 'self' > 197 return sorted([p.name for p in parameters]) > > RuntimeError: scikit-learn estimators should always specify their parameters > in the signature of their __init__ (no varargs). <class > 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>, > *args, **kwargs) doesn't follow this convention. > > In [4]: > #====================== > > The error is clear enough -- I can't use *args and **kwargs in a sklearn > estimator's __init__() -- but I'm not sure what the correct way is to do what > I need to do. Do I literally need to specify all of the __init__ params in my > subclass and then pass them on to the __init__ of super()? If so, what's the > reason for setting this up this way? > > > Thanks for any pointers/guidance, > Fred. > > > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140 > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general