And I lied that none of the scikit-learn estimators define their own
get_params. Of course the following do: VotingClassifier, Kernel (and
subclasses), Pipeline and FeatureUnion
On 23 March 2016 at 15:04, Joel Nothman <joel.noth...@gmail.com> wrote:
> something like the following may suffice:
>
> def get_params(self, deep=True):
> out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
> out['w2v_clusters'] = self.w2v_clusters
> return out
>
> On 23 March 2016 at 15:01, Joel Nothman <joel.noth...@gmail.com> wrote:
>
>> Hi Fred,
>>
>> We use the __init__ signature to get the list of parameters that (a) can
>> be set by grid search; (b) need to be copied to a cloned instance of the
>> estimator (with any fitted model discarded) in constructing ensembles,
>> cross validation, etc. While none of the scikit-learn library of estimators
>> do this, in practice you can overload get_params to define your own
>> parameter listing. See
>> http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params
>>
>> On 23 March 2016 at 14:45, Fred Mailhot <fred.mail...@gmail.com> wrote:
>>
>>> Hello list,
>>>
>>> Firstly, thanks for this incredible package; I use it daily at work. Now
>>> on to the meat: I'm trying to subclass TfidfVectorizer and running into
>>> issues. I want to specify an extra param for __init__() that points to a
>>> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
>>> the following:
>>>
>>> #======================
>>> class WordCooccurrenceVectorizer(TfidfVectorizer):
>>>
>>> ### override __init__ to add w2v_clusters arg
>>> # see
>>> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
>>> # for explanation of syntax
>>> def __init__(self, *args, **kwargs):
>>> try:
>>> self.w2v_cluster_path = kwargs.pop("w2v_clusters")
>>> except KeyError:
>>> pass
>>> super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
>>>
>>> def build_analyzer(self):
>>> preprocess = self.build_preprocessor()
>>> stopwords = self.get_stop_words()
>>> w2v_clusters = self.load_w2v_clusters()
>>> tokenize = self.build_tokenizer()
>>> return lambda doc:
>>> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
>>> [...]
>>> #======================
>>>
>>> I can instantiate this, but when I want to inspect it, I get the
>>> following (this is in ipython, in a script it just hangs):
>>>
>>> #======================
>>> In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
>>> stop_words="english", max_df=0.5, min_df=1, max_features=10000,
>>> w2v_clusters="clusters.20160322_1803.w2v", binary=True)
>>>
>>> In [3]: vec
>>> Out[3]:
>>> ---------------------------------------------------------------------------
>>> RuntimeError Traceback (most recent call
>>> last)
>>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
>>> in __call__(self, obj)
>>> 697 type_pprinters=self.type_printers,
>>> 698 deferred_pprinters=self.deferred_printers)
>>> --> 699 printer.pretty(obj)
>>> 700 printer.flush()
>>> 701 return stream.getvalue()
>>>
>>> [...]
>>>
>>> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
>>> in _get_param_names(cls)
>>> 193 " %s with constructor %s
>>> doesn't "
>>> 194 " follow this convention."
>>> --> 195 % (cls, init_signature))
>>> 196 # Extract and sort argument names excluding 'self'
>>> 197 return sorted([p.name for p in parameters])
>>>
>>> RuntimeError: scikit-learn estimators should always specify their
>>> parameters in the signature of their __init__ (no varargs). <class
>>> 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
>>> *args, **kwargs) doesn't follow this convention.
>>>
>>> In [4]:
>>> #======================
>>>
>>> The error is clear enough -- I can't use *args and **kwargs in a sklearn
>>> estimator's __init__() -- but I'm not sure what the correct way is to do
>>> what I need to do. Do I literally need to specify all of the __init__
>>> params in my subclass and then pass them on to the __init__ of super()? If
>>> so, what's the reason for setting this up this way?
>>>
>>>
>>> Thanks for any pointers/guidance,
>>> Fred.
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Transform Data into Opportunity.
>>> Accelerate data analysis in your applications with
>>> Intel Data Analytics Acceleration Library.
>>> Click to learn more.
>>> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general