Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20777#discussion_r174625203
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends
Params with HasInputCol wit
def getMinDF: Double = $(minDF)
/**
- * Specifies the maximum number of different documents a term must
appear in to be included
- * in the vocabulary.
- * If this is an integer greater than or equal to 1, this specifies the
number of documents
- * the term must appear in; if this is a double in [0,1), then this
specifies the fraction of
- * documents.
+ * maxDF is used for removing terms that appear too frequently. It
specifies the maximum number
+ * of different documents a term could appear in to be included in the
vocabulary.
+ * If this is an integer greater than or equal to 1, this specifies the
maximum number of
+ * documents the term could appear in; if this is a double in [0,1),
then this specifies the
+ * maximum fraction of documents the term could appear in. A term
appears more frequently
+ * than maxDF will be removed.
*
- * Default: (2^64^) - 1
+ * Default: (2^63) - 1
--- End diff --
good catch!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]