Re: [Scikit-learn-general] Strings as features

Abijith Kp Sun, 29 Jun 2014 23:55:29 -0700

In which version of sklearn, is the above mention 'make_pipeline' and
'make_union' defined??


When I read through some example, the idea of using FeatureUnion and
Pipelined are easy, I guess. Former chains the features obtained from each
individual estimators given as the input were as the latter uses the
estimators, on the result obtained from the previous estimator in a chained
fashion.



On Mon, Jun 23, 2014 at 1:06 AM, Joel Nothman <[email protected]>
wrote:

> Actually, it is a little easier with `make_pipeline` and `make_union`
> which weren't around at the time. I think it's a little more abstracted
> than most people who would come across this problem would be comfortable to
> implement.
>
> Still, it needs an example.
>
>
> On 22 June 2014 15:31, Andy <[email protected]> wrote:
>
>>  Yeah that is exactly what I was thinking about.
>> Though I would disagree that it is not simple to write and lengthy ;)
>>
>> class GetItemTransformer(TransformerMixin):
>>     def __init__(self, field):
>>         self.field = field
>>     # assume default fit()
>>     def transform(X):
>>         return X[field]
>>
>> transformer = FeatureUnion([
>>     (feat, Pipeline([
>>         ('get', GetItemTransformer(feat)),
>>         ('transform', TfidfTransformer())
>>     ])
>>      for feat in features
>>       ])
>>
>> Doesn't really seem so bad.
>> I agree it could probably be improved, but it could be worse ;)
>>
>> (That code above does completely solve the problem right?)
>>
>>
>>
>> On 06/22/2014 06:54 AM, Joel Nothman wrote:
>>
>> It is possible to do what you want, but it is not simple to write.
>> Scikit-learn could definitely benefit from an example showing this sort of
>> thing, or from a better API to help the user do it, as suggested at
>> https://github.com/scikit-learn/scikit-learn/issues/2034. There you will
>> find a lengthy comment where I give an example very similar to yours (but
>> with fields as attributes rather than dict keys).
>>
>>
>> On 21 June 2014 09:10, Abijith Kp <[email protected]> wrote:
>>
>>>   What would be the advantage for using a shared vocabulary for Count
>>> Vectorizer??
>>>
>>>  When I read about FeatureUnion, what I understood was that, the given
>>> list of transformers would process the given data set completely. Could we
>>> use it to selectively process different features?? Or is my understanding
>>> of the concept not clear??
>>>
>>>  Regards,
>>>  Abijith
>>>
>>>
>>> On Sat, Jun 21, 2014 at 7:12 PM, Andy <[email protected]> wrote:
>>>
>>>>  Yes, you can use CountVectorizer.
>>>> Do you want the different features to share the same vocabulary?
>>>> To use the Count Vectorizer, you probably have to either get all the
>>>> values (for a shared vocabulary)
>>>> or learn one CountVectorizer per key (you could use FeatureUnion for
>>>> that).
>>>>
>>>> So there is a little bit of code to write to handle the fact that you
>>>> have multiple text fields.
>>>>
>>>> Hth,
>>>> Andy
>>>>
>>>>
>>>>
>>>> On 06/21/2014 03:35 PM, Abijith Kp wrote:
>>>>
>>>>  Hi,
>>>>
>>>>
>>>>  Initially, one of my feature list looks like:  {"a":"3",
>>>> "b":"random1", "c":"", "d":"random2 text"}.
>>>>  The random text contains names of people, email ids, some
>>>> description, numbers and goes on.
>>>>
>>>>  When I used DictVectorizer, I could not get an accurate clustering.
>>>>
>>>>  I wanted know if I could get any method similar to DictVectorizer,
>>>> which could process a dictionary of string features, correctly.
>>>>
>>>>  Regards,
>>>> Abijith
>>>>
>>>>
>>>> On Sat, Jun 21, 2014 at 6:51 PM, Andy <[email protected]> wrote:
>>>>
>>>>>  Hi Abijith.
>>>>>
>>>>> It depends on how you want to interpret the strings.
>>>>> If they are texts and you want to interpret them based on their
>>>>> content, Brians suggestion is the right one.
>>>>> If you want to consider each possible string as a distinct feature,
>>>>> the OneHotEncoder would be the right choice.
>>>>>
>>>>> Could you give an example of what the strings and the semantics of the
>>>>> strings are?
>>>>>
>>>>> Andy
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 06/20/2014 06:05 PM, Abijith Kp wrote:
>>>>>
>>>>>    Can anyone help me with the problem of dealing with feature which
>>>>> are both strings of varying length(say from 0 to 100-150 characters) and
>>>>> numbers?
>>>>>
>>>>>  What will be the most widely used techniques in such kind of
>>>>> situations? And can it be solved using only scikit-learn?
>>>>>
>>>>>  PS: Initially I have to convert a json file to a feature's list, and
>>>>> then use it.
>>>>>
>>>>>  Any help is appreciated.
>>>>>
>>>>>  Regards,
>>>>> Abijith
>>>>>
>>>>> --
>>>>>  Abijith KP
>>>>> github.com/abijith-kp
>>>>> kpabijith.wordpress.com
>>>>>
>>>>>
>>>>>   
>>>>> ------------------------------------------------------------------------------
>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>> Leverages Graph Analysis for Fast Processing & Easy Data 
>>>>> Explorationhttp://p.sf.net/sfu/hpccsystems
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing 
>>>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>>> Solutions
>>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>>> http://p.sf.net/sfu/hpccsystems
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> [email protected]
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>  Abijith KP
>>>> github.com/abijith-kp
>>>> kpabijith.wordpress.com
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data 
>>>> Explorationhttp://p.sf.net/sfu/hpccsystems
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Scikit-learn-general mailing 
>>>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk
>>>> Solutions
>>>> Find What Matters Most in Your Big Data with HPCC Systems
>>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>>> http://p.sf.net/sfu/hpccsystems
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>>  Abijith KP
>>> github.com/abijith-kp
>>> kpabijith.wordpress.com
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>>> Find What Matters Most in Your Big Data with HPCC Systems
>>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>>> http://p.sf.net/sfu/hpccsystems
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data 
>> Explorationhttp://p.sf.net/sfu/hpccsystems
>>
>>
>>
>> _______________________________________________
>> Scikit-learn-general mailing 
>> [email protected]https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
>> Find What Matters Most in Your Big Data with HPCC Systems
>> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
>> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
>> http://p.sf.net/sfu/hpccsystems
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
> HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
> Find What Matters Most in Your Big Data with HPCC Systems
> Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
> Leverages Graph Analysis for Fast Processing & Easy Data Exploration
> http://p.sf.net/sfu/hpccsystems
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


-- 
Abijith KP
github.com/abijith-kp
kpabijith.wordpress.com

------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Strings as features

Reply via email to