Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/20367#discussion_r163640976
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala ---
@@ -113,7 +132,11 @@ private[feature] trait CountVectorizerParams extends
Params with HasInputCol wit
/** @group getParam */
def getBinary: Boolean = $(binary)
- setDefault(vocabSize -> (1 << 18), minDF -> 1.0, minTF -> 1.0, binary ->
false)
+ setDefault(vocabSize -> (1 << 18),
+ minDF -> 1.0,
+ maxDF -> Long.MaxValue,
--- End diff --
Yeah, I get it. It didn't work that way before; seems valuable only if you
can avoid the whole filter stage (both values are set to filter nothing). Even
there, I wonder if that makes any appreciable difference? I suppose I would
have just done the straightforward thing here, and so this change looks OK to
me.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]