Joel,

Thank you for your reply. I fixed the problem with defining my own
transformer, that does the same function as Binarizer, but produces sparse
matrix.

Regards, Egor


2014-08-30 18:07 GMT+08:00 Joel Nothman <[email protected]>:

> I cannot immediately tell why this doesn't work.
>
> Firstly, I assume (and hope) it has nothing to do with
> transformer_weights. Check that removing this still results in the error.
>
> The error implies that the transformers (pipelines) are producing data of
> different shape. Perhaps adding another transformer like this will help.
> Perhaps you should add a DebugTransformer into each pipeline:
>
> class DebugTransformer(TransformerMixin):
>     def __init__(self, name):
>         self.name = name
>
>     def transform(self, X):
>         print(self.name, 'got', X.shape)
>         return X
>
>     def fit(self, X, y=None):
>         return self
>
> and at least check the shapes directly.
>
> - Joel
>
>
>
> On 30 August 2014 12:48, Lakomkin Egor <[email protected]> wrote:
>
>> Hi all,
>>
>> I have heterogeneous data with text and binary features and I try to
>> handle it in FeatureUnion. I use HashingVectorizer for text data and
>> Binarizer for integer data(i need only know if the value of the feature >
>> 0).
>>
>> The problem is that the naive code that I have written did not work out
>> of the box. Is there any example of using together text and binary data in
>> FeatureUnion?
>>
>> I attached error description below and code/structure of Feature Union
>> that I tried. Thanks for help in advance!
>>
>> Platform: Windows 7, 64-bit, scikit-learn : 0.15.1
>> The error:
>> X_batch = transformer.transform(X_batch)
>>   File "C:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 384, in
>> transform
>>     Xs = sparse.hstack(Xs).tocsr()
>>   File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line
>> 453, in hstack
>>     return bmat([blocks], format=format, dtype=dtype)
>>   File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line
>> 567, in bmat
>>     raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
>> ValueError: blocks[0,:] has incompatible row dimensions
>>
>> Data that I feed to the transformer(in batches) is in the form [ {'title'
>> : ..., 'description' : '', 'phone_flag' : 1}, .. ]
>>
>> FeatureUnion structure that I use:
>>
>> transformer = FeatureUnion([
>>         ('description', Pipeline([
>>                 ('get', GetItemTransformer('description')),
>>                 ('vectorize',HashingVectorizer(encoding='utf-8',
>> n_features = N_TEXT_FEATURES, analyzer=analyzer)),
>>             ])
>>         ),
>>         ('title', Pipeline([
>>                 ('get', GetItemTransformer('title')),
>>                 ('vectorize',HashingVectorizer(encoding='utf-8',
>> n_features = N_TEXT_FEATURES, analyzer=analyzer)),
>>             ])
>>         ),
>>         ('flag',
>>             Pipeline([
>>                 ('get', GetItemTransformer('phone_flag')),
>>                 ('vectorize',Binarizer()),
>>             ])
>>         ),
>>     ],transformer_weights={'title': 2.0, 'description' : 1.0})
>>
>>
>> GetItemTransformer
>>
>> class GetItemTransformer(TransformerMixin):
>>     def __init__(self, field):
>>         self.field = field
>>
>>     def transform(self,X):
>>         if type(X) == type([]):
>>             return [x[self.field] for x in X]
>>         raise Exception("Not supported")
>>
>>     def fit(self,X,Y=None, **fit_params):
>>         return self
>>
>> Regards, Egor
>>
>>
>> ------------------------------------------------------------------------------
>> Slashdot TV.
>> Video for Nerds.  Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds.  Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to