Repository: spark Updated Branches: refs/heads/branch-1.6 bfcc8cfee -> 75531c77e
[SPARK-12217][ML] Document invalid handling for StringIndexer Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fra...@gmail.com> Closes #10257 from BenFradet/SPARK-12217. (cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c) Signed-off-by: Joseph K. Bradley <jos...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75531c77 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75531c77 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75531c77 Branch: refs/heads/branch-1.6 Commit: 75531c77e85073c7be18985a54c623710894d861 Parents: bfcc8cf Author: BenFradet <benjamin.fra...@gmail.com> Authored: Fri Dec 11 15:43:00 2015 -0800 Committer: Joseph K. Bradley <jos...@databricks.com> Committed: Fri Dec 11 15:43:09 2015 -0800 ---------------------------------------------------------------------- docs/ml-features.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/75531c77/docs/ml-features.md ---------------------------------------------------------------------- diff --git a/docs/ml-features.md b/docs/ml-features.md index 6494fed..8b00cc6 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -459,6 +459,42 @@ column, we should get the following: "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with index `2`. +Additionaly, there are two strategies regarding how `StringIndexer` will handle +unseen labels when you have fit a `StringIndexer` on one dataset and then use it +to transform another: + +- throw an exception (which is the default) +- skip the row containing the unseen label entirely + +**Examples** + +Let's go back to our previous example but this time reuse our previously defined +`StringIndexer` on the following dataset: + +~~~~ + id | category +----|---------- + 0 | a + 1 | b + 2 | c + 3 | d +~~~~ + +If you've not set how `StringIndexer` handles unseen labels or set it to +"error", an exception will be thrown. +However, if you had called `setHandleInvalid("skip")`, the following dataset +will be generated: + +~~~~ + id | category | categoryIndex +----|----------|--------------- + 0 | a | 0.0 + 1 | b | 2.0 + 2 | c | 1.0 +~~~~ + +Notice that the row containing "d" does not appear. + <div class="codetabs"> <div data-lang="scala" markdown="1"> --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org