That's not a solution I'm happy with :s
On 30 August 2014 21:35, Lakomkin Egor <[email protected]> wrote:
> Joel,
>
> Thank you for your reply. I fixed the problem with defining my own
> transformer, that does the same function as Binarizer, but produces sparse
> matrix.
>
> Regards, Egor
>
>
> 2014-08-30 18:07 GMT+08:00 Joel Nothman <[email protected]>:
>
> I cannot immediately tell why this doesn't work.
>>
>> Firstly, I assume (and hope) it has nothing to do with
>> transformer_weights. Check that removing this still results in the error.
>>
>> The error implies that the transformers (pipelines) are producing data of
>> different shape. Perhaps adding another transformer like this will help.
>> Perhaps you should add a DebugTransformer into each pipeline:
>>
>> class DebugTransformer(TransformerMixin):
>> def __init__(self, name):
>> self.name = name
>>
>> def transform(self, X):
>> print(self.name, 'got', X.shape)
>> return X
>>
>> def fit(self, X, y=None):
>> return self
>>
>> and at least check the shapes directly.
>>
>> - Joel
>>
>>
>>
>> On 30 August 2014 12:48, Lakomkin Egor <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I have heterogeneous data with text and binary features and I try to
>>> handle it in FeatureUnion. I use HashingVectorizer for text data and
>>> Binarizer for integer data(i need only know if the value of the feature >
>>> 0).
>>>
>>> The problem is that the naive code that I have written did not work out
>>> of the box. Is there any example of using together text and binary data in
>>> FeatureUnion?
>>>
>>> I attached error description below and code/structure of Feature Union
>>> that I tried. Thanks for help in advance!
>>>
>>> Platform: Windows 7, 64-bit, scikit-learn : 0.15.1
>>> The error:
>>> X_batch = transformer.transform(X_batch)
>>> File "C:\Anaconda\lib\site-packages\sklearn\pipeline.py", line 384, in
>>> transform
>>> Xs = sparse.hstack(Xs).tocsr()
>>> File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line
>>> 453, in hstack
>>> return bmat([blocks], format=format, dtype=dtype)
>>> File "C:\Anaconda\lib\site-packages\scipy\sparse\construct.py", line
>>> 567, in bmat
>>> raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
>>> ValueError: blocks[0,:] has incompatible row dimensions
>>>
>>> Data that I feed to the transformer(in batches) is in the form [
>>> {'title' : ..., 'description' : '', 'phone_flag' : 1}, .. ]
>>>
>>> FeatureUnion structure that I use:
>>>
>>> transformer = FeatureUnion([
>>> ('description', Pipeline([
>>> ('get', GetItemTransformer('description')),
>>> ('vectorize',HashingVectorizer(encoding='utf-8',
>>> n_features = N_TEXT_FEATURES, analyzer=analyzer)),
>>> ])
>>> ),
>>> ('title', Pipeline([
>>> ('get', GetItemTransformer('title')),
>>> ('vectorize',HashingVectorizer(encoding='utf-8',
>>> n_features = N_TEXT_FEATURES, analyzer=analyzer)),
>>> ])
>>> ),
>>> ('flag',
>>> Pipeline([
>>> ('get', GetItemTransformer('phone_flag')),
>>> ('vectorize',Binarizer()),
>>> ])
>>> ),
>>> ],transformer_weights={'title': 2.0, 'description' : 1.0})
>>>
>>>
>>> GetItemTransformer
>>>
>>> class GetItemTransformer(TransformerMixin):
>>> def __init__(self, field):
>>> self.field = field
>>>
>>> def transform(self,X):
>>> if type(X) == type([]):
>>> return [x[self.field] for x in X]
>>> raise Exception("Not supported")
>>>
>>> def fit(self,X,Y=None, **fit_params):
>>> return self
>>>
>>> Regards, Egor
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Slashdot TV.
>>> Video for Nerds. Stuff that matters.
>>> http://tv.slashdot.org/
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Slashdot TV.
>> Video for Nerds. Stuff that matters.
>> http://tv.slashdot.org/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> Slashdot TV.
> Video for Nerds. Stuff that matters.
> http://tv.slashdot.org/
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general