Hello list,

Firstly, thanks for this incredible package; I use it daily at work. Now on
to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I've got
the following:

#======================
class WordCooccurrenceVectorizer(TfidfVectorizer):

    ### override __init__ to add w2v_clusters arg
    # see
http://stackoverflow.com/questions/2215923/avoid-specifying-all-arguments-in-a-subclass
    # for explanation of syntax
    def __init__(self, *args, **kwargs):
        try:
            self.w2v_cluster_path = kwargs.pop("w2v_clusters")
        except KeyError:
            pass
        super(WordCooccurrenceVectorizer, self).__init__(*args, **kwargs)

    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        stopwords = self.get_stop_words()
        w2v_clusters = self.load_w2v_clusters()
        tokenize = self.build_tokenizer()
        return lambda doc:
self._nwise(tokenize(preprocess(self.decode(doc))), stopwords, w2v_clusters)
    [...]
#======================

I can instantiate this, but when I want to inspect it, I get the following
(this is in ipython, in a script it just hangs):

#======================
In [2]: vec = WordCooccurrenceVectorizer(ngram_range=(2,2),
stop_words="english", max_df=0.5, min_df=1, max_features=10000,
w2v_clusters="clusters.20160322_1803.w2v", binary=True)

In [3]: vec
Out[3]:
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/IPython/core/formatters.pyc
in __call__(self, obj)
    697                 type_pprinters=self.type_printers,
    698                 deferred_pprinters=self.deferred_printers)
--> 699             printer.pretty(obj)
    700             printer.flush()
    701             return stream.getvalue()

[...]

/Users/fredmailhot/anaconda/envs/csai_experiments/lib/python2.7/site-packages/sklearn/base.pyc
in _get_param_names(cls)
    193                                    " %s with constructor %s doesn't
"
    194                                    " follow this convention."
--> 195                                    % (cls, init_signature))
    196         # Extract and sort argument names excluding 'self'
    197         return sorted([p.name for p in parameters])

RuntimeError: scikit-learn estimators should always specify their
parameters in the signature of their __init__ (no varargs). <class
'cooc_vectorizer.WordCooccurrenceVectorizer'> with constructor (<self>,
*args, **kwargs) doesn't  follow this convention.

In [4]:
#======================

The error is clear enough -- I can't use *args and **kwargs in a sklearn
estimator's __init__() -- but I'm not sure what the correct way is to do
what I need to do. Do I literally need to specify all of the __init__
params in my subclass and then pass them on to the __init__ of super()? If
so, what's the reason for setting this up this way?


Thanks for any pointers/guidance,
Fred.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to