Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20777#discussion_r174626899
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -679,6 +679,29 @@ def test_count_vectorizer_with_binary(self):
                 feature, expected = r
                 self.assertEqual(feature, expected)
     
    +    def test_count_vectorizer_with_maxDF(self):
    +        dataset = self.spark.createDataFrame([
    +            (0, "a b c d".split(' '), SparseVector(3, {0: 1.0, 1: 1.0, 2: 
1.0}),),
    +            (1, "a b c".split(' '), SparseVector(3, {0: 1.0, 1: 1.0}),),
    +            (2, "a b".split(' '), SparseVector(3, {0: 1.0}),),
    +            (3, "a".split(' '), SparseVector(3,  {}),)], ["id", "words", 
"expected"])
    +        cv = CountVectorizer(inputCol="words", outputCol="features")
    +        model1 = cv.setMaxDF(3).fit(dataset)
    --- End diff --
    
    Actually, I still don't think you setting the `maxDF` value is doing 
anything different to the model.  You want the test to fail if you do not set 
the value to 3.  I think to do this you will need to also assert that the 
vocabulary is equal to something


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to