Re: [Scikit-learn-general] Strings as features

Andy Sun, 22 Jun 2014 12:33:07 -0700

Yeah that is exactly what I was thinking about.
Though I would disagree that it is not simple to write and lengthy ;)


class  GetItemTransformer(TransformerMixin):
    def  __init__(self,  field):
        self.field  =  field
    # assume default fit()
    def  transform(X):
        return  X[field]


transformer  =  FeatureUnion([
    (feat,  Pipeline([
        ('get',  GetItemTransformer(feat)),
        ('transform',  TfidfTransformer())
    ])
        for feat in features
          ])

Doesn't really seem so bad.
I agree it could probably be improved, but it could be worse ;)

(That code above does completely solve the problem right?)


On 06/22/2014 06:54 AM, Joel Nothman wrote:

It is possible to do what you want, but it is not simple to write.Scikit-learn could definitely benefit from an example showing thissort of thing, or from a better API to help the user do it, assuggested at https://github.com/scikit-learn/scikit-learn/issues/2034.There you will find a lengthy comment where I give an example verysimilar to yours (but with fields as attributes rather than dict keys).

On 21 June 2014 09:10, Abijith Kp <[email protected]<mailto:[email protected]>> wrote:


    What would be the advantage for using a shared vocabulary for
    Count Vectorizer??

    When I read about FeatureUnion, what I understood was that, the
    given list of transformers would process the given data set
    completely. Could we use it to selectively process different
    features?? Or is my understanding of the concept not clear??

    Regards,
    Abijith


    On Sat, Jun 21, 2014 at 7:12 PM, Andy <[email protected]
    <mailto:[email protected]>> wrote:

        Yes, you can use CountVectorizer.
        Do you want the different features to share the same vocabulary?
        To use the Count Vectorizer, you probably have to either get
        all the values (for a shared vocabulary)
        or learn one CountVectorizer per key (you could use
        FeatureUnion for that).

        So there is a little bit of code to write to handle the fact
        that you have multiple text fields.

        Hth,
        Andy



        On 06/21/2014 03:35 PM, Abijith Kp wrote:

        Hi,


        Initially, one of my feature list looks like:  {"a":"3",
        "b":"random1", "c":"", "d":"random2 text"}.
        The random text contains names of people, email ids, some
        description, numbers and goes on.

        When I used DictVectorizer, I could not get an accurate
        clustering.

        I wanted know if I could get any method similar to
        DictVectorizer, which could process a dictionary of string
        features, correctly.

        Regards,
        Abijith


        On Sat, Jun 21, 2014 at 6:51 PM, Andy <[email protected]
        <mailto:[email protected]>> wrote:

            Hi Abijith.

            It depends on how you want to interpret the strings.
            If they are texts and you want to interpret them based on
            their content, Brians suggestion is the right one.
            If you want to consider each possible string as a
            distinct feature, the OneHotEncoder would be the right
            choice.

            Could you give an example of what the strings and the
            semantics of the strings are?

            Andy




            On 06/20/2014 06:05 PM, Abijith Kp wrote:

            Can anyone help me with the problem of dealing with
            feature which are both strings of varying length(say
            from 0 to 100-150 characters) and numbers?

            What will be the most widely used techniques in such
            kind of situations? And can it be solved using only
            scikit-learn?

            PS: Initially I have to convert a json file to a
            feature's list, and then use it.

            Any help is appreciated.

            Regards,
            Abijith

--Abijith KP

            github.com/abijith-kp <http://github.com/abijith-kp>
            kpabijith.wordpress.com <http://kpabijith.wordpress.com>


            
------------------------------------------------------------------------------
            HPCC Systems Open Source Big Data Platform from LexisNexis Risk 
Solutions
            Find What Matters Most in Your Big Data with HPCC Systems
            Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
            Leverages Graph Analysis for Fast Processing & Easy Data Exploration
            http://p.sf.net/sfu/hpccsystems


            _______________________________________________
            Scikit-learn-general mailing list
            [email protected]  
<mailto:[email protected]>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



            
------------------------------------------------------------------------------
            HPCC Systems Open Source Big Data Platform from
            LexisNexis Risk Solutions
            Find What Matters Most in Your Big Data with HPCC Systems
            Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
            Leverages Graph Analysis for Fast Processing & Easy Data
            Exploration
            http://p.sf.net/sfu/hpccsystems
            _______________________________________________
            Scikit-learn-general mailing list
            [email protected]
            <mailto:[email protected]>
            https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--Abijith KP

        github.com/abijith-kp <http://github.com/abijith-kp>
        kpabijith.wordpress.com <http://kpabijith.wordpress.com>


        
------------------------------------------------------------------------------
        HPCC Systems Open Source Big Data Platform from LexisNexis Risk 
Solutions
        Find What Matters Most in Your Big Data with HPCC Systems
        Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
        Leverages Graph Analysis for Fast Processing & Easy Data Exploration
        http://p.sf.net/sfu/hpccsystems


        _______________________________________________
        Scikit-learn-general mailing list
        [email protected]  
<mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



        
------------------------------------------------------------------------------
        HPCC Systems Open Source Big Data Platform from LexisNexis
        Risk Solutions
        Find What Matters Most in Your Big Data with HPCC Systems
        Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
        Leverages Graph Analysis for Fast Processing & Easy Data
        Exploration
        http://p.sf.net/sfu/hpccsystems
        _______________________________________________
        Scikit-learn-general mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--Abijith KP

    github.com/abijith-kp <http://github.com/abijith-kp>
    kpabijith.wordpress.com <http://kpabijith.wordpress.com>

    
------------------------------------------------------------------------------
    HPCC Systems Open Source Big Data Platform from LexisNexis Risk
    Solutions
    Find What Matters Most in Your Big Data with HPCC Systems
    Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
    Leverages Graph Analysis for Fast Processing & Easy Data Exploration
    http://p.sf.net/sfu/hpccsystems
    _______________________________________________
    Scikit-learn-general mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems


_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Strings as features

Reply via email to