[Scikit-learn-general] Pickling custom Transformers in a Pipeline

Fred Mailhot Tue, 05 Apr 2016 13:16:43 -0700

Hi all,

I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.


I assumed I could avoid that by creating custom transformers that directly
subclass TransformerMixin and importing them to the module where the
pipeline is defined.

The transformer is implemented like this:

*==============================*
*[...imports...]*
*from text_preprocess import TextPreprocess*

*class CustomTransformer(TransformerMixin):*

*    def __init__(self, param_file_1="params.txt"):*
*        self.pattern_file = pattern_file*

*        self.custom = TextPreprocess(self.param_file)*

*    def transform(self, X, *_):*
*        if isinstance(X, basestring):*
*            X = [X]*
*        return ["%s %s" % (x, " ".join([item["rewrite"] for item in*
*                   self.custom.match(x)["info"] if "rewrite" in item]))
for x in X]*

*    def fit(self, *_):*
*        return self*
*==============================*

the full pipeline look like this:

*==============================*
*cm = CustomTransformer()*

*vec = FeatureUnion([("char_ng",*
*                     CountVectorizer(analyzer="char_wb",
tokenizer=string.split,*
*                                     ngram_range=(3, 5),
max_features=None, min_df=1,*
*                                     max_df=0.5, **stop_words=None,
binary=False)),*
*                    ("word_ng",*
*                     CountVectorizer(analyzer="word", ngram_range=(2, 3), *
*                                     max_features=5000, min_df=1,
max_df=0.5,*
*                                     stop_words="english", *
*binary=False))])*

*pipeline = Pipeline([("custom", cm), ("vec", vec),*
*                     ("lr", LogisticRegressionCV(scoring="f1_macro"))])*
*==============================*

And I get the following error when I fit & dump:

*==============================*
*In [62]: pipeline.fit(docs, [0, 0, 0, 1])*
*Out[62]:*
*Pipeline(steps=[('custom', <cm_transformer.CustomTransformer object at
0x113dd2310>), ('vec', FeatureUnion(n_jobs=1,** transformer_list=[('char_ng',
CountVectorizer(analyzer='char_wb', binary=False, decode_error=u'strict',*
*  ...None,*
*           refit=True, scoring='f1_macro', solver='lbfgs', tol=0.0001,*
*           verbose=0))])*

*In [63]: pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*
*---------------------------------------------------------------------------*
*PicklingError                             Traceback (most recent call
last)*
*<ipython-input-63-99a63544716d> in <module>()*
*----> 1 pickle.dump(pipeline, open("test_pl_dump.pkl", "wb"),
pickle.HIGHEST_PROTOCOL)*

*PicklingError: Can't pickle <type 'function'>: attribute lookup
__builtin__.function failed*
*==============================*

Any pointers would be appreciated. There are hints here and there on SO,
but most point to the solution I referred to above...

Thanks!
Fred.

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Pickling custom Transformers in a Pipeline

Reply via email to