Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/20367#discussion_r163911373
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -160,6 +187,11 @@ class CountVectorizer @Since("1.5.0") (@Since("1.5.0")
override val uid: String)
} else {
$(minDF) * input.cache().count()
}
+ val maxDf = if ($(maxDF) >= 1.0) {
+ $(maxDF)
+ } else {
+ $(maxDF) * input.cache().count()
+ }
--- End diff --
I wonder if it's best to check that the max is >= the min? this may be the
only place that can be done as one might be a fraction and the other not.
Also this ends up counting the input twice when both params are set, which
might be worth avoiding. It could be computed and saved only when needed.
(I suppose this also fails to unpersist `input`, which could happen right
after `wordCounts.count()`. That's a slightly different issue, so not too
concerned about whether to fix it here or not.)
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]