Hi all,
I have heterogeneous data with text and binary features and I try to handle
it in FeatureUnion. I use HashingVectorizer for text data and Binarizer for
integer data(i need only know if the value of the feature > 0).
The problem is that the naive code that I have written did not work out of
the box. Is there any example of using together text and binary data in
FeatureUnion?
I attached error description below and code/structure of Feature Union that
I tried. Thanks for help in advance!
Platform: Windows 7, 64-bit, scikit-learn : 0.15.1
The error:
X_batch = transformer.transform(X_batch)
File "C:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 384, in
transform
Xs = sparse.hstack(Xs).tocsr()
File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line 453,
in hstack
return bmat([blocks], format=format, dtype=dtype)
File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line 567,
in bmat
raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions
Data that I feed to the transformer(in batches) is in the form [ {'title' :
..., 'description' : '', 'phone_flag' : 1}, .. ]
FeatureUnion structure that I use:
transformer = FeatureUnion([
('description', Pipeline([
('get', GetItemTransformer('description')),
('vectorize',HashingVectorizer(encoding='utf-8', n_features
= N_TEXT_FEATURES, analyzer=analyzer)),
])
),
('title', Pipeline([
('get', GetItemTransformer('title')),
('vectorize',HashingVectorizer(encoding='utf-8', n_features
= N_TEXT_FEATURES, analyzer=analyzer)),
])
),
('flag',
Pipeline([
('get', GetItemTransformer('phone_flag')),
('vectorize',Binarizer()),
])
),
],transformer_weights={'title': 2.0, 'description' : 1.0})
GetItemTransformer
class GetItemTransformer(TransformerMixin):
def __init__(self, field):
self.field = field
def transform(self,X):
if type(X) == type([]):
return [x[self.field] for x in X]
raise Exception("Not supported")
def fit(self,X,Y=None, **fit_params):
return self
Regards, Egor
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general