[GitHub] spark pull request #16770: [SPARK-15009][PYTHON][ML] Construct a CountVector...

BryanCutler Wed, 14 Mar 2018 12:02:01 -0700

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16770#discussion_r174576552
  
    --- Diff: python/pyspark/ml/tests.py ---
    @@ -640,6 +640,33 @@ def test_count_vectorizer_with_binary(self):
                 feature, expected = r
                 self.assertEqual(feature, expected)
     
    +    def test_count_vectorizer_from_vocab(self):
    +        model = CountVectorizerModel.from_vocabulary(["a", "b", "c"], 
inputCol="words",
    +                                                     outputCol="features", 
minTF=2)
    +        self.assertEqual(model.vocabulary, ["a", "b", "c"])
    +        self.assertEqual(model.getMinTF(), 2)
    +
    +        dataset = self.spark.createDataFrame([
    +            (0, "a a a b b c".split(' '), SparseVector(3, {0: 3.0, 1: 
2.0}),),
    +            (1, "a a".split(' '), SparseVector(3, {0: 2.0}),),
    +            (2, "a b".split(' '), SparseVector(3, {}),)], ["id", "words", 
"expected"])
    +
    +        transformed_list = model.transform(dataset).select("features", 
"expected").collect()
    +
    +        for r in transformed_list:
    +            feature, expected = r
    +            self.assertEqual(feature, expected)
    +
    +        # Test an empty vocabulary
    +        with QuietTest(self.sc):
    +            with self.assertRaisesRegexp(Exception, 
"vocabSize.*invalid.*0"):
    +                CountVectorizerModel.from_vocabulary([], inputCol="words")
    +
    +        # Test model with default settings can transform
    +        model_default = CountVectorizerModel.from_vocabulary(["a", "b", 
"c"], inputCol="words")
    +        transformed_list = model_default.transform(dataset).collect()
    +        self.assertEqual(len(transformed_list), 3)
    --- End diff --
    
    The doctest uses default values for all params except `outputCol` and 
checks the transformed values, so this is really just testing that nothing 
fails if all param default values are used including `outputCol`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16770: [SPARK-15009][PYTHON][ML] Construct a CountVector...

Reply via email to