[
https://issues.apache.org/jira/browse/SPARK-50837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056498#comment-18056498
]
Varun Lakhiyani edited comment on SPARK-50837 at 2/4/26 4:59 PM:
-----------------------------------------------------------------
The behavior appears to originate in
{code:java}
src/main/scala/org/apache/spark/sql/connect/KeyValueGroupedDataset.scala {code}
inside
{code:java}
aggUntypedWithValueMapFunc() (around L507).
{code}
In this code path, the grouping expression changes from using an
UNRESOLVED_STAR to an UNRESOLVED_ATTRIBUTE("iv.key"), resulting in the grouping
UDF being serialized as:
{code:java}
common_inline_user_defined_function {
deterministic: true
arguments {
unresolved_attribute {
unparsed_identifier: "iv.key"
}
}
}{code}
This causes
{code:java}
SparkConnectPlanner.isTypedScalaUdfExpr(...)
(src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:2591){code}
to return false, since it currently requires:
{code:java}
udf.getArguments(0).getExprTypeCase ==
proto.Expression.ExprTypeCase.UNRESOLVED_STAR {code}
Once the argument is an UNRESOLVED_ATTRIBUTE, the typed Scala UDF is no longer
detected. It is unclear how this detection can be made robust, since any
expression or marker introduced in aggUntypedWithValueMapFunc() to identify
KeyValueGroupedDataset internals can be reproduced by user-defined expressions,
leading to another issue
P.S. Open to any suggestions or alternative approaches to resolve this issue.
was (Author: JIRAUSER312211):
The behavior appears to originate in
{code:java}
src/main/scala/org/apache/spark/sql/connect/KeyValueGroupedDataset.scala {code}
inside
{code:java}
aggUntypedWithValueMapFunc() (around L507).
{code}
In this code path, the grouping expression changes from using an
UNRESOLVED_STAR to an UNRESOLVED_ATTRIBUTE("iv.key"), resulting in the grouping
UDF being serialized as:
{code:java}
common_inline_user_defined_function {
deterministic: true
arguments {
unresolved_attribute {
unparsed_identifier: "iv.key"
}
}
}{code}
This causes
{code:java}
SparkConnectPlanner.isTypedScalaUdfExpr(...)
(src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:2591){code}
to return false, since it currently requires:
{code:java}
udf.getArguments(0).getExprTypeCase ==
proto.Expression.ExprTypeCase.UNRESOLVED_STAR {code}
Once the argument is an UNRESOLVED_ATTRIBUTE, the typed Scala UDF is no longer
detected. It is unclear how this detection can be made robust, since any
expression or marker introduced in aggUntypedWithValueMapFunc() to identify
KeyValueGroupedDataset internals can be reproduced by user-defined expressions,
leading to another issue
> Key attribute of a primitive type of typed aggregation should be named as
> "key"
> -------------------------------------------------------------------------------
>
> Key: SPARK-50837
> URL: https://issues.apache.org/jira/browse/SPARK-50837
> Project: Spark
> Issue Type: Bug
> Components: Connect
> Affects Versions: 4.0.0
> Reporter: Pengfei Xu
> Priority: Major
>
> SPARK-26085 is broken again. We need to do the column name restoration better.
> One idea to fix this is to implement special Tuple1 and Tuple2 encoders:
> * Tuple1 has a field called "value".
> * Tuple2 has two fields called "key" and "value".
> and use them for primitive types.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]