[jira] [Updated] (SPARK-38039) From Scala/Java API, higher-level functions (like exists) produce wrong expressions when nested

Dmitry Lapshin (Jira) Wed, 26 Jan 2022 07:47:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dmitry Lapshin updated SPARK-38039:
-----------------------------------
    Description: 
Consider this code in Scala:
{code:scala}
case class DemoSubRow(a: Int, b: Array[Int])
case class DemoRow(elems: Array[DemoSubRow])
dataFrame.withColumn(
    "goodElems",
    filter(elems, x -> exists(x.getField("b"), y -> x.getField("a") == y)
{code}
One could expect that is would be equivalent to
{code:sql}
SELECT *, filter(elems, x -> exists(x.b, y -> x.a == y)) AS goodElems
    FROM dataFrame
{code}
However, it's not. If you look into {{org.apache.spark.sql.functions}} object, 
you'll see that private method:
{code:scala}
private def createLambda(f: Column => Column) = {
  val x = UnresolvedNamedLambdaVariable(Seq("x"))
  val function = f(Column(x)).expr
  LambdaFunction(function, Seq(x))
}
{code}
If you look closely you'll see that column that is passed into the lambda is 
always *unresolved* variable {{{}x{}}}. Because of that, column from Scala 
sample above is seen as:
{code:sql}
… filter(elems, x -> exists(x.b, x -> x.a == x)) AS goodElems …
                                 ^^^^^^^^^^^^^
{code}
and is obviously wrong. In given example, it will produce 
{_}AnalysisException{_}, however it can also silently execute wrong code (i.e., 
imagine there actually is dataframe column _x_ or something).

My current workaround is a reflection hack to call {{{}functions.withExpr{}}}, 
but it's a really bad one.

What should probably be done is instead of hard-coded name {{x}} there should 
be a generated unique variable name, or even some proper locally bound resolved 
variables (because at the moment of lambda creation this variable can be 
considered already resolved), however there are concerns about how that name 
would be displayed to an end user if there is an analysis error. Sorry, but at 
the moment of reporting this issue I have no ideas how to solve that.

  was:
Consider this code in Scala:
{code:scala}
case class DemoSubRow(a: Int, b: Array[Int])
case class DemoRow(elems: Array[DemoSubRow])
dataFrame.withColumn(
    "goodElems",
    filter(elems, x -> exists(x.getField("b"), y -> x.getField("a") == y)
{code}
One could expect that is would be equivalent to
{code:sql}
SELECT *, filter(elems, x -> exists(x.b, y -> x.a == y)) AS goodElems
    FROM dataFrame
{code}
However, it's not. If you look into {{org.apache.spark.sql.functions}} object, 
you'll see that private method:
{code:scala}
private def createLambda(f: Column => Column) = {
  val x = UnresolvedNamedLambdaVariable(Seq("x"))
  val function = f(Column(x)).expr
  LambdaFunction(function, Seq(x))
}
{code}
If you look closely you'll see that column that is passed into the lambda is 
always *unresolved* variable {{{}x{}}}. Because of that, column from Scala 
sample above is seen as:
{code:sql}
… filter(elems, x -> exists(x.b, x -> x.a == x)) AS goodElems …
                                 ^^^^^^^^^^^^^
{code}
and is obviously wrong. In given example, it will produce 
{_}AnalysisException{_}, however it can also silently execute wrong code (i.e., 
imagine there actually is dataframe column _x_ or something).

My current workaround is a reflection hack to call `functions.withExpr`, but 
it's a really bad one.

What should probably be done is instead of hard-coded name {{x}} there should 
be a generated unique variable name, or even some proper locally bound resolved 
variables (because at the moment of lambda creation this variable can be 
considered already resolved), however there are concerns about how that name 
would be displayed to an end user if there is an analysis error. Sorry, but at 
the moment of reporting this issue I have no ideas how to solve that.


> From Scala/Java API, higher-level functions (like exists) produce wrong 
> expressions when nested
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-38039
>                 URL: https://issues.apache.org/jira/browse/SPARK-38039
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API, SQL
>    Affects Versions: 3.1.1
>            Reporter: Dmitry Lapshin
>            Priority: Major
>
> Consider this code in Scala:
> {code:scala}
> case class DemoSubRow(a: Int, b: Array[Int])
> case class DemoRow(elems: Array[DemoSubRow])
> dataFrame.withColumn(
>     "goodElems",
>     filter(elems, x -> exists(x.getField("b"), y -> x.getField("a") == y)
> {code}
> One could expect that is would be equivalent to
> {code:sql}
> SELECT *, filter(elems, x -> exists(x.b, y -> x.a == y)) AS goodElems
>     FROM dataFrame
> {code}
> However, it's not. If you look into {{org.apache.spark.sql.functions}} 
> object, you'll see that private method:
> {code:scala}
> private def createLambda(f: Column => Column) = {
>   val x = UnresolvedNamedLambdaVariable(Seq("x"))
>   val function = f(Column(x)).expr
>   LambdaFunction(function, Seq(x))
> }
> {code}
> If you look closely you'll see that column that is passed into the lambda is 
> always *unresolved* variable {{{}x{}}}. Because of that, column from Scala 
> sample above is seen as:
> {code:sql}
> … filter(elems, x -> exists(x.b, x -> x.a == x)) AS goodElems …
>                                  ^^^^^^^^^^^^^
> {code}
> and is obviously wrong. In given example, it will produce 
> {_}AnalysisException{_}, however it can also silently execute wrong code 
> (i.e., imagine there actually is dataframe column _x_ or something).
> My current workaround is a reflection hack to call 
> {{{}functions.withExpr{}}}, but it's a really bad one.
> What should probably be done is instead of hard-coded name {{x}} there should 
> be a generated unique variable name, or even some proper locally bound 
> resolved variables (because at the moment of lambda creation this variable 
> can be considered already resolved), however there are concerns about how 
> that name would be displayed to an end user if there is an analysis error. 
> Sorry, but at the moment of reporting this issue I have no ideas how to solve 
> that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-38039) From Scala/Java API, higher-level functions (like exists) produce wrong expressions when nested

Reply via email to