Re: [Scikit-learn-general] Subclassing vectorizers

Fred Mailhot Wed, 23 Mar 2016 08:26:01 -0700

Thanks very much everyone; seems to be working now!


On 23 March 2016 at 00:58, Sebastian Raschka <[email protected]> wrote:

> Hah, and I just wanted to write regarding the VotingClassifier — I
> remember my struggle quite well when I tried to to make it pipeline and
> GridSearch compatible until I figured that one out :P
>
> > On Mar 23, 2016, at 12:34 AM, Joel Nothman <[email protected]>
> wrote:
> >
> > And I lied that none of the scikit-learn estimators define their own
> get_params. Of course the following do: VotingClassifier, Kernel (and
> subclasses), Pipeline and FeatureUnion
> >
> > On 23 March 2016 at 15:04, Joel Nothman <[email protected]> wrote:
> > something like the following may suffice:
> >
> > def get_params(self, deep=True):
> >     out = super(WordCooccurrenceVectorizer, self).get_params(deep=deep)
> >     out['w2v_clusters'] = self.w2v_clusters
> >     return out
> >
> > On 23 March 2016 at 15:01, Joel Nothman <[email protected]> wrote:
> > Hi Fred,
> >
> > We use the __init__ signature to get the list of parameters that (a) can
> be set by grid search; (b) need to be copied to a cloned instance of the
> estimator (with any fitted model discarded) in constructing ensembles,
> cross validation, etc. While none of the scikit-learn library of estimators
> do this, in practice you can overload get_params to define your own
> parameter listing. See
> http://scikit-learn.org/stable/developers/contributing.html#get-params-and-set-params
> >
> > On 23 March 2016 at 14:45, Fred Mailhot <[email protected]> wrote:
> > Hello list,
> >
> > Firstly, thanks for this incredible package; I use it daily at work. Now
> on to the meat: I'm trying to subclass TfidfVectorizer and running into
> issues. I want to specify an extra param for __init__() that points to a
> file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
> the following:
> >
> > #======================
> > class WordCooccurrenceVectorizer(TfidfVectorizer):
> >
> >     ### override __init__ to add w2v_clusters arg
> >     # see
> http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
> >     # for explanation of syntax
> >     def __init__(self, *args, **kwargs):
> >         try:
> >             self.w2v_cluster_path = kwargs.pop("w2v_clusters")
> >         except KeyError:
> >             pass
> >         super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)
> >
> >     def build_analyzer(self):
> >         preprocess = self.build_preprocessor()
> >         stopwords = self.get_stop_words()
> >         w2v_clusters = self.load_w2v_clusters()
> >         tokenize = self.build_tokenizer()
> >         return lambda doc:
> self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
> >     [...]
> > #======================
> >
> > I can instantiate this, but when I want to inspect it, I get the
> following (this is in ipython, in a script it just hangs):
> >
> > #======================
> > In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
> stop_words="english", max_df=0.5, min_df=1, max_features=10000,
> w2v_clusters="clusters.20160322_1803.w2v", binary=True)
> >
> > In [3]: vec
> > Out[3]:
> ---------------------------------------------------------------------------
> > RuntimeError                              Traceback (most recent call
> last)
> >
> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
> in __call__(self, obj)
> >     697                 type_pprinters=self.type_printers,
> >     698                 deferred_pprinters=self.deferred_printers)
> > --> 699             printer.pretty(obj)
> >     700             printer.flush()
> >     701             return stream.getvalue()
> >
> > [...]
> >
> >
> /Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
> in _get_param_names(cls)
> >     193                                    " %s with constructor %s
> doesn't "
> >     194                                    " follow this convention."
> > --> 195                                    % (cls, init_signature))
> >     196         # Extract and sort argument names excluding 'self'
> >     197         return sorted([p.name for p in parameters])
> >
> > RuntimeError: scikit-learn estimators should always specify their
> parameters in the signature of their __init__ (no varargs). <class
> 'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
> *args, **kwargs) doesn't  follow this convention.
> >
> > In [4]:
> > #======================
> >
> > The error is clear enough -- I can't use *args and **kwargs in a sklearn
> estimator's __init__() -- but I'm not sure what the correct way is to do
> what I need to do. Do I literally need to specify all of the __init__
> params in my subclass and then pass them on to the __init__ of super()? If
> so, what's the reason for setting this up this way?
> >
> >
> > Thanks for any pointers/guidance,
> > Fred.
> >
> >
> >
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Transform Data into Opportunity.
> > Accelerate data analysis in your applications with
> > Intel Data Analytics Acceleration Library.
> > Click to learn more.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140_______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Subclassing vectorizers

Reply via email to