Github user mgaido91 commented on a diff in the pull request:
https://github.com/apache/spark/pull/20520#discussion_r167081534
--- Diff: python/pyspark/ml/tests.py ---
@@ -1620,6 +1621,23 @@ def test_kmeans_summary(self):
self.assertEqual(s.k, 2)
+class KMeansTests(SparkSessionTestCase):
+
+ def test_kmeans_cosine_distance(self):
+ data = [(Vectors.dense([1.0, 1.0]),), (Vectors.dense([10.0,
10.0]),),
+ (Vectors.dense([1.0, 0.5]),), (Vectors.dense([10.0,
4.4]),),
+ (Vectors.dense([-1.0, 1.0]),), (Vectors.dense([-100.0,
90.0]),)]
+ df = self.spark.createDataFrame(data, ["features"])
+ kmeans = KMeans(k=3, seed=1)
+ kmeans.setDistanceMeasure("cosine")
+ model = kmeans.fit(df)
+ result = model.transform(df).rdd.collectAsMap()
+ self.assertTrue(result[Vectors.dense([1.0, 1.0])] ==
result[Vectors.dense([10.0, 10.0])])
--- End diff --
sorry but I can't really understand what you are thinking about: here I
compare that the prediction for two points contains the same value. Since the
value is not known in advance I cannot just check if the dataframe is equal to
something predefined. Am I missing something?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]