Re: [Scikit-learn-general] Pipeline: string categorical data preprocessing

Andreas Mueller Mon, 28 Mar 2016 12:34:09 -0700

Hi.
In general, please stay on the mailing list.

We could make the check_array in FunctionTransformer optional via aparameter.


Cheers,
Andy

On 03/28/2016 01:34 PM, Алексей Драль wrote:

Hi Andreas,

Nice, I didn't know about make_pipeline before, thank you. I haveexactly the situation that you pointed out "categories are stringsthat can frequently don't show up only in test split". I'll take thisapproach in mind for the next time.

P.S. testing revealed usage of check_array in FunctionTransformer,which can lead to problems when dtype objects are strings.P.P.S. at first, I was wondering if it would be valuable to make apull request, but CategoricalEncoder should fix the problem.

2016-03-28 18:58 GMT+03:00 Andreas Mueller <[email protected]<mailto:[email protected]>>:


    Untested code:

    make_pipeline(FunctionTransformer(lambda X: pd.get_dummies(X)),
    SomeClassifier())

    giant caveat: that will only work if the categories are exactly
    the same in all possible X that you pass.
    Otherwise weird stuff will happen.


    On 03/26/2016 07:21 AM, Алексей Драль wrote:

    Hi Andreas,

    Sadly enough, get_dummies is not applicable in pipelines. Thank
    you for a link with a fix.

    2016-03-25 18:57 GMT+03:00 Andreas Mueller <[email protected]
    <mailto:[email protected]>>:

        This is very common but currently not that easy.
        There is a fix here:
        https://github.com/scikit-learn/scikit-learn/pull/6559

        In the meantime, I think the easiest way is to use pandas'
        get_dummies function.


        On 03/19/2016 02:17 PM, Алексей Драль wrote:

        Hi there,

        I have a data set which contains string categorical
        variables (like
        "category_A", "category_B"). I would like to generate dummy
        variables from
        them, but I can't use OneHotEncoder as it expects matrix of
        integers. I
        cannot use LabelEncoder neither, because I cannot provide
        columns to
        process. I wrote a simple class to do so that
        applies DictionaryVectorizer per column and stores fitted
        processors. This
        use case looks so common, that I expect that sklearn should
        contain some
        functionality to do so. Could you please assist me if I miss any
        standard preprocessor to generate dummy variables from
        strings for
        specified columns?

        --
        Yours sincerely,
        Alexey A. Dral


        
------------------------------------------------------------------------------
        Transform Data into Opportunity.
        Accelerate data analysis in your applications with
        Intel Data Analytics Acceleration Library.
        Click to learn more.
        http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140


        _______________________________________________
        Scikit-learn-general mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

--Yours sincerely,

    Alexey A. Dral





--
Yours sincerely,
Alexey A. Dral

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Pipeline: string categorical data preprocessing

Reply via email to