GitHub user yogeshg opened a pull request:
https://github.com/apache/spark/pull/20829
[SPARK-23690] [ML] Add handleinvalid to VectorAssembler
## What changes were proposed in this pull request?
Introduce `handleInvalid` parameter in `VectorAssembler` that can take in
`"keep", "skip", "error"` options. "error" throws an error on seeing a row
containing a `null`, "skip" filters out all such rows, and "keep" adds relevant
number of NaN. "keep" figures out an example to find out what this number of
NaN s should be added and throws an error when no such number could be found.
## How was this patch tested?
Unit tests are added to check the behavior of `assemble` on specific rows
and the transformer is called on `DataFrame`s of different configurations to
test different corner cases.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yogeshg/spark rformula_handleinvalid
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20829.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20829
----
commit 17883798ee406670e52497a32b6a6f55f3e8fbc4
Author: Bago Amirbekian <bago@...>
Date: 2018-03-14T22:42:27Z
Better error for streaming dataframes, ensure non-null Vectors in first.
commit c34332d656849ff8d2b9fbfe752076d8a37cc430
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-09T01:29:18Z
add NaN for null column
commit f2f763dc54401f6b8009cd99e42b1eb2891a1f8c
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-14T22:13:04Z
get lengths with a map
commit 272a806cee85f1028b29da418c5ac8ca27a99cb6
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-14T22:43:12Z
wip
commit dc99db851b46b7112e04c86c9f3fccc197aa97d2
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-14T22:45:11Z
wip
commit 61fbcc42891e834b9914def6dadbe1ff24725998
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-14T22:45:24Z
wip
commit 08b8c048afe2a806bbf8f2ce5017a16ba08997e8
Author: Bago Amirbekian <bago@...>
Date: 2018-03-14T22:50:05Z
Merge branch 'rformula_handleinvalid' into vectorAssemblerStuff
commit 8c98d368739539d1113096c4ae8b81bda27eb950
Author: Bago Amirbekian <bago@...>
Date: 2018-03-14T23:55:11Z
Merge fixes.
commit cb0faba9dbc860cf5550bb59904a02c17ced8a00
Author: Yogesh Garg <1059168+yogeshg@...>
Date: 2018-03-15T00:01:50Z
Merge pull request #2 from MrBago/vectorAssemblerStuff
Vector assembler stuff
commit 3c3532c624b796f723720747d4e5453a1316b329
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-15T00:45:20Z
fix issues with this implementation
commit d29228ceed9479841cf2b8e4994388c6b628f0c1
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-15T02:02:53Z
fix bugs; add tests
commit c0c0e3df92dc787d9ce9ca632353d86c6457420d
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Date: 2018-03-15T02:12:59Z
clean
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]