GitHub user yogeshg opened a pull request:

    https://github.com/apache/spark/pull/20829

    [SPARK-23690] [ML] Add handleinvalid to VectorAssembler

    ## What changes were proposed in this pull request?
    
    Introduce `handleInvalid` parameter in `VectorAssembler` that can take in 
`"keep", "skip", "error"` options. "error" throws an error on seeing a row 
containing a `null`, "skip" filters out all such rows, and "keep" adds relevant 
number of NaN. "keep" figures out an example to find out what this number of 
NaN s should be added and throws an error when no such number could be found.
    
    ## How was this patch tested?
    
    Unit tests are added to check the behavior of `assemble` on specific rows 
and the transformer is called on `DataFrame`s of different configurations to 
test different corner cases.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/yogeshg/spark rformula_handleinvalid

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20829.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20829
    
----
commit 17883798ee406670e52497a32b6a6f55f3e8fbc4
Author: Bago Amirbekian <bago@...>
Date:   2018-03-14T22:42:27Z

    Better error for streaming dataframes, ensure non-null Vectors in first.

commit c34332d656849ff8d2b9fbfe752076d8a37cc430
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-09T01:29:18Z

    add NaN for null column

commit f2f763dc54401f6b8009cd99e42b1eb2891a1f8c
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-14T22:13:04Z

    get lengths with a map

commit 272a806cee85f1028b29da418c5ac8ca27a99cb6
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-14T22:43:12Z

    wip

commit dc99db851b46b7112e04c86c9f3fccc197aa97d2
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-14T22:45:11Z

    wip

commit 61fbcc42891e834b9914def6dadbe1ff24725998
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-14T22:45:24Z

    wip

commit 08b8c048afe2a806bbf8f2ce5017a16ba08997e8
Author: Bago Amirbekian <bago@...>
Date:   2018-03-14T22:50:05Z

    Merge branch 'rformula_handleinvalid' into vectorAssemblerStuff

commit 8c98d368739539d1113096c4ae8b81bda27eb950
Author: Bago Amirbekian <bago@...>
Date:   2018-03-14T23:55:11Z

    Merge fixes.

commit cb0faba9dbc860cf5550bb59904a02c17ced8a00
Author: Yogesh Garg <1059168+yogeshg@...>
Date:   2018-03-15T00:01:50Z

    Merge pull request #2 from MrBago/vectorAssemblerStuff
    
    Vector assembler stuff

commit 3c3532c624b796f723720747d4e5453a1316b329
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-15T00:45:20Z

    fix issues with this implementation

commit d29228ceed9479841cf2b8e4994388c6b628f0c1
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-15T02:02:53Z

    fix bugs; add tests

commit c0c0e3df92dc787d9ce9ca632353d86c6457420d
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date:   2018-03-15T02:12:59Z

    clean

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to