Re: [Scikit-learn-general] Problem with stacking text and binary features in FeatureUnion

Joel Nothman Sat, 30 Aug 2014 03:08:19 -0700

I cannot immediately tell why this doesn't work.

Firstly, I assume (and hope) it has nothing to do with transformer_weights.
Check that removing this still results in the error.


The error implies that the transformers (pipelines) are producing data of
different shape. Perhaps adding another transformer like this will help.
Perhaps you should add a DebugTransformer into each pipeline:

class DebugTransformer(TransformerMixin):
    def __init__(self, name):
        self.name = name

    def transform(self, X):
        print(self.name, 'got', X.shape)
        return X

    def fit(self, X, y=None):
        return self

and at least check the shapes directly.

- Joel



On 30 August 2014 12:48, Lakomkin Egor <[email protected]> wrote:

> Hi all,
>
> I have heterogeneous data with text and binary features and I try to
> handle it in FeatureUnion. I use HashingVectorizer for text data and
> Binarizer for integer data(i need only know if the value of the feature >
> 0).
>
> The problem is that the naive code that I have written did not work out of
> the box. Is there any example of using together text and binary data in
> FeatureUnion?
>
> I attached error description below and code/structure of Feature Union
> that I tried. Thanks for help in advance!
>
> Platform: Windows 7, 64-bit, scikit-learn : 0.15.1
> The error:
> X_batch = transformer.transform(X_batch)
>   File "C:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 384, in
> transform
>     Xs = sparse.hstack(Xs).tocsr()
>   File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line
> 453, in hstack
>     return bmat([blocks], format=format, dtype=dtype)
>   File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line
> 567, in bmat
>     raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
> ValueError: blocks[0,:] has incompatible row dimensions
>
> Data that I feed to the transformer(in batches) is in the form [ {'title'
> : ..., 'description' : '', 'phone_flag' : 1}, .. ]
>
> FeatureUnion structure that I use:
>
> transformer = FeatureUnion([
>         ('description', Pipeline([
>                 ('get', GetItemTransformer('description')),
>                 ('vectorize',HashingVectorizer(encoding='utf-8',
> n_features = N_TEXT_FEATURES, analyzer=analyzer)),
>             ])
>         ),
>         ('title', Pipeline([
>                 ('get', GetItemTransformer('title')),
>                 ('vectorize',HashingVectorizer(encoding='utf-8',
> n_features = N_TEXT_FEATURES, analyzer=analyzer)),
>             ])
>         ),
>         ('flag',
>             Pipeline([
>                 ('get', GetItemTransformer('phone_flag')),
>                 ('vectorize',Binarizer()),
>             ])
>         ),
>     ],transformer_weights={'title': 2.0, 'description' : 1.0})
>
>
> GetItemTransformer
>
> class GetItemTransformer(TransformerMixin):
>     def __init__(self, field):
>         self.field = field
>
>     def transform(self,X):
>         if type(X) == type([]):
>             return [x[self.field] for x in X]
>         raise Exception("Not supported")
>
>     def fit(self,X,Y=None, **fit_params):
>         return self
>
> Regards, Egor
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Problem with stacking text and binary features in FeatureUnion

Reply via email to