This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new f48751527251 [SPARK-51567][ML][CONNECT] Fix 
`DistributedLDAModel.vocabSize`
f48751527251 is described below

commit f4875152725128486fc3d318a3abb90e40d9b8da
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Mar 20 14:23:11 2025 +0800

    [SPARK-51567][ML][CONNECT] Fix `DistributedLDAModel.vocabSize`
    
    ### What changes were proposed in this pull request?
    Fix `DistributedLDAModel.vocabSize`
    
    ### Why are the changes needed?
    ```
    pyspark.errors.exceptions.connect.SparkException: 
[CONNECT_ML.ATTRIBUTE_NOT_ALLOWED] Generic Spark Connect ML error. vocabSize in 
org.apache.spark.ml.clustering.DistributedLDAModel is not allowed to be 
accessed. SQLSTATE: XX000
    
    JVM stacktrace:
    org.apache.spark.sql.connect.ml.MLAttributeNotAllowedException
            at 
org.apache.spark.sql.connect.ml.MLUtils$.validate(MLUtils.scala:686)
            at 
org.apache.spark.sql.connect.ml.MLUtils$.invokeMethodAllowed(MLUtils.scala:691)
            at 
org.apache.spark.sql.connect.ml.AttributeHelper.$anonfun$getAttribute$1(MLHandler.scala:56)
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    yes, new api supported
    
    ### How was this patch tested?
    added test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #50330 from zhengruifeng/ml_connect_lda_vocabSize.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 python/pyspark/ml/tests/test_clustering.py                            | 1 +
 .../src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala      | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/python/pyspark/ml/tests/test_clustering.py 
b/python/pyspark/ml/tests/test_clustering.py
index f89c7305fc9c..a35eaac10a7e 100644
--- a/python/pyspark/ml/tests/test_clustering.py
+++ b/python/pyspark/ml/tests/test_clustering.py
@@ -404,6 +404,7 @@ class ClusteringTestsMixin:
         self.assertNotIsInstance(model, LocalLDAModel)
         self.assertIsInstance(model, DistributedLDAModel)
         self.assertTrue(model.isDistributed())
+        self.assertEqual(model.vocabSize(), 2)
 
         dc = model.estimatedDocConcentration()
         self.assertTrue(np.allclose(dc.toArray(), [26.0, 26.0], atol=1e-4), dc)
diff --git 
a/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
 
b/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
index c11a153cde5b..9346074ed448 100644
--- 
a/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
+++ 
b/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/ml/MLUtils.scala
@@ -621,8 +621,8 @@ private[ml] object MLUtils {
         "isDistributed",
         "logLikelihood",
         "logPerplexity",
-        "describeTopics")),
-    (classOf[LocalLDAModel], Set("vocabSize")),
+        "describeTopics",
+        "vocabSize")),
     (
       classOf[DistributedLDAModel],
       Set("trainingLogLikelihood", "logPrior", "getCheckpointFiles", 
"toLocal")),


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to