This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new f48751527251 [SPARK-51567][ML][CONNECT] Fix
`DistributedLDAModel.vocabSize`
f48751527251 is described below
commit f4875152725128486fc3d318a3abb90e40d9b8da
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Mar 20 14:23:11 2025 +0800
[SPARK-51567][ML][CONNECT] Fix `DistributedLDAModel.vocabSize`
### What changes were proposed in this pull request?
Fix `DistributedLDAModel.vocabSize`
### Why are the changes needed?
```
pyspark.errors.exceptions.connect.SparkException:
[CONNECT_ML.ATTRIBUTE_NOT_ALLOWED] Generic Spark Connect ML error. vocabSize in
org.apache.spark.ml.clustering.DistributedLDAModel is not allowed to be
accessed. SQLSTATE: XX000
JVM stacktrace:
org.apache.spark.sql.connect.ml.MLAttributeNotAllowedException
at
org.apache.spark.sql.connect.ml.MLUtils$.validate(MLUtils.scala:686)
at
org.apache.spark.sql.connect.ml.MLUtils$.invokeMethodAllowed(MLUtils.scala:691)
at
org.apache.spark.sql.connect.ml.AttributeHelper.$anonfun$getAttribute$1(MLHandler.scala:56)
```
### Does this PR introduce _any_ user-facing change?
yes, new api supported
### How was this patch tested?
added test
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #50330 from zhengruifeng/ml_connect_lda_vocabSize.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
python/pyspark/ml/tests/test_clustering.py | 1 +
.../src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala | 4 ++--
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/python/pyspark/ml/tests/test_clustering.py
b/python/pyspark/ml/tests/test_clustering.py
index f89c7305fc9c..a35eaac10a7e 100644
--- a/python/pyspark/ml/tests/test_clustering.py
+++ b/python/pyspark/ml/tests/test_clustering.py
@@ -404,6 +404,7 @@ class ClusteringTestsMixin:
self.assertNotIsInstance(model, LocalLDAModel)
self.assertIsInstance(model, DistributedLDAModel)
self.assertTrue(model.isDistributed())
+ self.assertEqual(model.vocabSize(), 2)
dc = model.estimatedDocConcentration()
self.assertTrue(np.allclose(dc.toArray(), [26.0, 26.0], atol=1e-4), dc)
diff --git
a/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
b/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
index c11a153cde5b..9346074ed448 100644
---
a/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
+++
b/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
@@ -621,8 +621,8 @@ private[ml] object MLUtils {
"isDistributed",
"logLikelihood",
"logPerplexity",
- "describeTopics")),
- (classOf[LocalLDAModel], Set("vocabSize")),
+ "describeTopics",
+ "vocabSize")),
(
classOf[DistributedLDAModel],
Set("trainingLogLikelihood", "logPrior", "getCheckpointFiles",
"toLocal")),
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]