Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20777#discussion_r174624206
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -70,19 +70,22 @@ private[feature] trait CountVectorizerParams extends
Params with HasInputCol wit
def getMinDF: Double = $(minDF)
/**
- * Specifies the maximum number of different documents a term must
appear in to be included
- * in the vocabulary.
- * If this is an integer greater than or equal to 1, this specifies the
number of documents
- * the term must appear in; if this is a double in [0,1), then this
specifies the fraction of
- * documents.
+ * maxDF is used for removing terms that appear too frequently. It
specifies the maximum number
+ * of different documents a term could appear in to be included in the
vocabulary.
+ * If this is an integer greater than or equal to 1, this specifies the
maximum number of
+ * documents the term could appear in; if this is a double in [0,1),
then this specifies the
+ * maximum fraction of documents the term could appear in. A term
appears more frequently
+ * than maxDF will be removed.
--- End diff --
This sounds much better, but probably should use ignore instead of remove
and might be good to just change the order of the sentence like this:
```
Specifies the maximum number of different documents a term could appear in
to be included
in the vocabulary. A term that appears more than the threshold will be
ignored. If this is an
integer greater than or equal to 1, this specifies the maximum number of
documents the term
could appear in; if this is a double in [0,1), then this specifies the
maximum fraction of
documents the term could appear in.
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]