GitHub user crackcell opened a pull request:

    https://github.com/apache/spark/pull/17233

    [SPARK-11569][ML] Fix StringIndexer to handle null value properly

    ## What changes were proposed in this pull request?
    
    This PR is to enhance StringIndexer with NULL values handling.
    
    Before the PR, StringIndexer will throw an exception when encounters NULL 
values.
    With this PR:
    - handleInvalid=error: Throw an exception as before
    - handleInvalid=skip: Skip null values as well as unseen labels
    - handleInvalid=keep: Give null values an additional index as well as 
unseen labels
    
    BTW, I noticed someone was trying to solve the same problem ( #9920 ) but 
seems getting no progress or response for a long time. Would you mind give a 
chance to solve it ?
    
    ## How was this patch tested?
    
    new unit tests

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/crackcell/spark 11569_StringIndexer_NULL

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17233
    
----
commit 75e3975597aa6271f4f8ab688922edda88b03045
Author: Menglong TAN <[email protected]>
Date:   2017-03-08T03:50:17Z

    Merge pull request #1 from apache/master
    
    merge master to my repo

commit 79d706085e8371fb1724ce73377767c38d551e5d
Author: Menglong TAN <[email protected]>
Date:   2017-03-10T04:45:56Z

    Enhance StringIndexer with NULL values

commit 0cb121c65f592b9623bdeef2746d7c2a3c281ae1
Author: Menglong TAN <[email protected]>
Date:   2017-03-10T04:52:30Z

    filter out NULLs when transform dataset

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to