Hi.
In general, please stay on the mailing list.
We could make the check_array in FunctionTransformer optional via a parameter.

Cheers,
Andy

On 03/28/2016 01:34 PM, Алексей Драль wrote:
Hi Andreas,

Nice, I didn't know about make_pipeline before, thank you. I have exactly the situation that you pointed out "categories are strings that can frequently don't show up only in test split". I'll take this approach in mind for the next time.

P.S. testing revealed usage of check_array in FunctionTransformer, which can lead to problems when dtype objects are strings. P.P.S. at first, I was wondering if it would be valuable to make a pull request, but CategoricalEncoder should fix the problem.


2016-03-28 18:58 GMT+03:00 Andreas Mueller <t3k...@gmail.com <mailto:t3k...@gmail.com>>:

    Untested code:

    make_pipeline(FunctionTransformer(lambda X: pd.get_dummies(X)),
    SomeClassifier())

    giant caveat: that will only work if the categories are exactly
    the same in all possible X that you pass.
    Otherwise weird stuff will happen.


    On 03/26/2016 07:21 AM, Алексей Драль wrote:
    Hi Andreas,

    Sadly enough, get_dummies is not applicable in pipelines. Thank
    you for a link with a fix.

    2016-03-25 18:57 GMT+03:00 Andreas Mueller <t3k...@gmail.com
    <mailto:t3k...@gmail.com>>:

        This is very common but currently not that easy.
        There is a fix here:
        https://github.com/scikit-learn/scikit-learn/pull/6559

        In the meantime, I think the easiest way is to use pandas'
        get_dummies function.


        On 03/19/2016 02:17 PM, Алексей Драль wrote:
        Hi there,

        I have a data set which contains string categorical
        variables (like
        "category_A", "category_B"). I would like to generate dummy
        variables from
        them, but I can't use OneHotEncoder as it expects matrix of
        integers. I
        cannot use LabelEncoder neither, because I cannot provide
        columns to
        process. I wrote a simple class to do so that
        applies DictionaryVectorizer per column and stores fitted
        processors. This
        use case looks so common, that I expect that sklearn should
        contain some
        functionality to do so. Could you please assist me if I miss any
        standard preprocessor to generate dummy variables from
        strings for
        specified columns?

        --
        Yours sincerely,
        Alexey A. Dral


        
------------------------------------------------------------------------------
        Transform Data into Opportunity.
        Accelerate data analysis in your applications with
        Intel Data Analytics Acceleration Library.
        Click to learn more.
        http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140


        _______________________________________________
        Scikit-learn-general mailing list
        Scikit-learn-general@lists.sourceforge.net
        <mailto:Scikit-learn-general@lists.sourceforge.net>
        https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




-- Yours sincerely,
    Alexey A. Dral




--
Yours sincerely,
Alexey A. Dral

------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785471&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to