[PR] [SPARK-46538][ML] Fix an ambiguous column reference issue in `ALSModel.transform` [spark]

via GitHub Thu, 28 Dec 2023 03:59:05 -0800


zhengruifeng opened a new pull request, #44526:
URL: https://github.com/apache/spark/pull/44526


   ### What changes were proposed in this pull request?
   the column references  in `ALSModel.transform` maybe ambiguous in some case
   
   ### Why are the changes needed?
   to fix a bug
   
   before this fix, the test fails with:
   ```
   ======================================================================
   ERROR [5.597s]: test_ambiguous_column 
(pyspark.ml.tests.test_als.ALSTest.test_ambiguous_column)
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/ml/tests/test_als.py", 
line 47, in test_ambiguous_column
       predictions = loaded_model.transform(users.crossJoin(items))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/Users/ruifeng.zheng/Dev/spark/python/pyspark/ml/base.py", line 260, 
in transform
       return self._transform(dataset)
              ^^^^^^^^^^^^^^^^^^^^^^^^
   ...
   
       raise converted from None
   pyspark.errors.exceptions.captured.AnalysisException: Column features#50, 
features#46 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
   
   JVM stacktrace:
   org.apache.spark.sql.AnalysisException: Column features#50, features#46 are 
ambiguous. It's probably because you joined several Datasets together, and some 
of these Datasets are the same. This column points to one of the Datasets but 
Spark is unable to figure out which one. Please alias the Datasets with 
different names via `Dataset.as` before joining them, and specify the column 
using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > $"b.id")`. 
You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false to disable 
this check.
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.ambiguousAttributesInSelfJoinError(QueryCompilationErrors.scala:1998)
   ```
   
   
   ### Does this PR introduce _any_ user-facing change?
   yes, bug fix
   
   ### How was this patch tested?
   added ut
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-46538][ML] Fix an ambiguous column reference issue in `ALSModel.transform` [spark]

Reply via email to