Hi all,

I have heterogeneous data with text and binary features and I try to handle
it in FeatureUnion. I use HashingVectorizer for text data and Binarizer for
integer data(i need only know if the value of the feature > 0).

The problem is that the naive code that I have written did not work out of
the box. Is there any example of using together text and binary data in
FeatureUnion?

I attached error description below and code/structure of Feature Union that
I tried. Thanks for help in advance!

Platform: Windows 7, 64-bit, scikit-learn : 0.15.1
The error:
X_batch = transformer.transform(X_batch)
  File "C:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 384, in
transform
    Xs = sparse.hstack(Xs).tocsr()
  File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line 453,
in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line 567,
in bmat
    raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions

Data that I feed to the transformer(in batches) is in the form [ {'title' :
..., 'description' : '', 'phone_flag' : 1}, .. ]

FeatureUnion structure that I use:

transformer = FeatureUnion([
        ('description', Pipeline([
                ('get', GetItemTransformer('description')),
                ('vectorize',HashingVectorizer(encoding='utf-8', n_features
= N_TEXT_FEATURES, analyzer=analyzer)),
            ])
        ),
        ('title', Pipeline([
                ('get', GetItemTransformer('title')),
                ('vectorize',HashingVectorizer(encoding='utf-8', n_features
= N_TEXT_FEATURES, analyzer=analyzer)),
            ])
        ),
        ('flag',
            Pipeline([
                ('get', GetItemTransformer('phone_flag')),
                ('vectorize',Binarizer()),
            ])
        ),
    ],transformer_weights={'title': 2.0, 'description' : 1.0})


GetItemTransformer

class GetItemTransformer(TransformerMixin):
    def __init__(self, field):
        self.field = field

    def transform(self,X):
        if type(X) == type([]):
            return [x[self.field] for x in X]
        raise Exception("Not supported")

    def fit(self,X,Y=None, **fit_params):
        return self

Regards, Egor
------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to