spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer

jkbradley Fri, 11 Dec 2015 15:43:53 -0800

Repository: spark
Updated Branches:
  refs/heads/branch-1.6 bfcc8cfee -> 75531c77e



[SPARK-12217][ML] Document invalid handling for StringIndexer

Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features 
documentation.

I wonder if I should also add a snippet to the code example, input welcome.

Author: BenFradet <benjamin.fra...@gmail.com>

Closes #10257 from BenFradet/SPARK-12217.

(cherry picked from commit aea676ca2d07c72b1a752e9308c961118e5bfc3c)
Signed-off-by: Joseph K. Bradley <jos...@databricks.com>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/75531c77
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/75531c77
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/75531c77

Branch: refs/heads/branch-1.6
Commit: 75531c77e85073c7be18985a54c623710894d861
Parents: bfcc8cf
Author: BenFradet <benjamin.fra...@gmail.com>
Authored: Fri Dec 11 15:43:00 2015 -0800
Committer: Joseph K. Bradley <jos...@databricks.com>
Committed: Fri Dec 11 15:43:09 2015 -0800

----------------------------------------------------------------------
 docs/ml-features.md | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/75531c77/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6494fed..8b00cc6 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -459,6 +459,42 @@ column, we should get the following:
 "a" gets index `0` because it is the most frequent, followed by "c" with index 
`1` and "b" with
 index `2`.
 
+Additionaly, there are two strategies regarding how `StringIndexer` will handle
+unseen labels when you have fit a `StringIndexer` on one dataset and then use 
it
+to transform another:
+
+- throw an exception (which is the default)
+- skip the row containing the unseen label entirely
+
+**Examples**
+
+Let's go back to our previous example but this time reuse our previously 
defined
+`StringIndexer` on the following dataset:
+
+~~~~
+ id | category
+----|----------
+ 0  | a
+ 1  | b
+ 2  | c
+ 3  | d
+~~~~
+
+If you've not set how `StringIndexer` handles unseen labels or set it to
+"error", an exception will be thrown.
+However, if you had called `setHandleInvalid("skip")`, the following dataset
+will be generated:
+
+~~~~
+ id | category | categoryIndex
+----|----------|---------------
+ 0  | a        | 0.0
+ 1  | b        | 2.0
+ 2  | c        | 1.0
+~~~~
+
+Notice that the row containing "d" does not appear.
+
 <div class="codetabs">
 
 <div data-lang="scala" markdown="1">


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-12217][ML] Document invalid handling for StringIndexer

Reply via email to