Github user BryanCutler commented on a diff in the pull request:
https://github.com/apache/spark/pull/20777#discussion_r174626899
--- Diff: python/pyspark/ml/tests.py ---
@@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self):
feature, expected = r
self.assertEqual(feature, expected)
+ def test_count_vectorizer_with_maxDF(self):
+ dataset = self.spark.createDataFrame([
+ (0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2:
1.0}),),
+ (1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),),
+ (2, "a b".split(' '), SparseVector(3, {0: 1.0}),),
+ (3, "a".split(' '), SparseVector(3, {}),)], ["id", "words",
"expected"])
+ cv = CountVectorizer(inputCol="words", outputCol="features")
+ model1 = cv.setMaxDF(3).fit(dataset)
--- End diff --
Actually, I still don't think you setting the `maxDF` value is doing
anything different to the model. You want the test to fail if you do not set
the value to 3. I think to do this you will need to also assert that the
vocabulary is equal to something
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]