[ 
https://issues.apache.org/jira/browse/SPARK-50837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056498#comment-18056498
 ] 

Varun Lakhiyani edited comment on SPARK-50837 at 2/4/26 5:13 PM:
-----------------------------------------------------------------

The behavior appears to originate in
{code:java}
src/main/scala/org/apache/spark/sql/connect/KeyValueGroupedDataset.scala {code}
inside
{code:java}
aggUntypedWithValueMapFunc() (around L507).
{code}
In this code path, the grouping expression changes from using an 
UNRESOLVED_STAR to an UNRESOLVED_ATTRIBUTE("iv.key"), resulting in the grouping 
UDF being serialized as:
{code:java}
common_inline_user_defined_function {
  deterministic: true
  arguments {
    unresolved_attribute {
      unparsed_identifier: "iv.key"
    }
  }
}{code}
This causes 
{code:java}
SparkConnectPlanner.isTypedScalaUdfExpr(...)
(src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:2591){code}
to return false, since it currently requires:
{code:java}
udf.getArguments(0).getExprTypeCase ==
  proto.Expression.ExprTypeCase.UNRESOLVED_STAR {code}
Once the argument is an UNRESOLVED_ATTRIBUTE, the typed Scala UDF is no longer 
detected. It is unclear how this detection can be made robust, since any 
expression or marker introduced in aggUntypedWithValueMapFunc() to identify 
KeyValueGroupedDataset internals can be reproduced by user-defined expressions, 
leading to another issue

P.S. Open to any suggestions or alternative approaches to resolve this issue.

[~viirya] , [~cloud_fan] (SPARK-26085, PR-23054), and [~xupaddy] (SPARK-43415, 
PR-49111) - if you get a chance to look at this, I would really appreciate any 
help.


was (Author: JIRAUSER312211):
The behavior appears to originate in
{code:java}
src/main/scala/org/apache/spark/sql/connect/KeyValueGroupedDataset.scala {code}
inside
{code:java}
aggUntypedWithValueMapFunc() (around L507).
{code}
In this code path, the grouping expression changes from using an 
UNRESOLVED_STAR to an UNRESOLVED_ATTRIBUTE("iv.key"), resulting in the grouping 
UDF being serialized as:
{code:java}
common_inline_user_defined_function {
  deterministic: true
  arguments {
    unresolved_attribute {
      unparsed_identifier: "iv.key"
    }
  }
}{code}
This causes 
{code:java}
SparkConnectPlanner.isTypedScalaUdfExpr(...)
(src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:2591){code}
to return false, since it currently requires:
{code:java}
udf.getArguments(0).getExprTypeCase ==
  proto.Expression.ExprTypeCase.UNRESOLVED_STAR {code}
Once the argument is an UNRESOLVED_ATTRIBUTE, the typed Scala UDF is no longer 
detected. It is unclear how this detection can be made robust, since any 
expression or marker introduced in aggUntypedWithValueMapFunc() to identify 
KeyValueGroupedDataset internals can be reproduced by user-defined expressions, 
leading to another issue

P.S. Open to any suggestions or alternative approaches to resolve this issue.

> Key attribute of a primitive type of typed aggregation should be named as 
> "key"
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-50837
>                 URL: https://issues.apache.org/jira/browse/SPARK-50837
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 4.0.0
>            Reporter: Pengfei Xu
>            Priority: Major
>
> SPARK-26085 is broken again. We need to do the column name restoration better.
> One idea to fix this is to implement special Tuple1 and Tuple2 encoders:
> * Tuple1 has a field called "value".
> * Tuple2 has two fields called "key" and "value".
> and use them for primitive types.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to