[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

Dongjoon Hyun (JIRA) Tue, 26 Feb 2019 09:51:16 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16778246#comment-16778246
 ]


Dongjoon Hyun commented on SPARK-26996:
---------------------------------------

Thank you for reporting, [~ilya745]. I also confirmed that the given example 
work at Spark 2.3.3 and fails Spark 2.4.0.

> Scalar Subquery not handled properly in Spark 2.4 
> --------------------------------------------------
>
>                 Key: SPARK-26996
>                 URL: https://issues.apache.org/jira/browse/SPARK-26996
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Ilya Peysakhov
>            Priority: Critical
>
> Spark 2.4 reports an error when querying a dataframe that has only 1 row and 
> 1 column (scalar subquery). 
>  
> Reproducer is below. No other data is needed to reproduce the error.
> This will write a table of dates and strings, write another "fact" table of 
> ints and dates, then read both tables as views and filter the "fact" based on 
> the max(date) from the first table. This is done within spark-shell in spark 
> 2.4 vanilla (also reproduced in AWS EMR 5.20.0)
> -------------------------
> spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
> select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' 
> UNION ALL select '2018-01-04' ,'source4' 
> ").write.mode("overwrite").save("/latest_dates")
>  val mydatetable = spark.read.load("/latest_dates")
>  mydatetable.createOrReplaceTempView("latest_dates")
> spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
> '2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
> '2018-01-01' UNION ALL select 600, '2018-08-30' 
> ").write.mode("overwrite").partitionBy("date").save("/mypartitioneddata")
>  val source1 = spark.read.load("/mypartitioneddata")
>  source1.createOrReplaceTempView("source1")
> spark.sql("select max(date), 'source1' as category from source1 where date >= 
> (select latest_date from latest_dates where source='source1') ").show
>  ----------------------------
>  
> Error summary
> —
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#35 []
>  at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalarSubquery.eval(subquery.scala:246)
> -------
> This reproducer works in previous versions (2.3.2, 2.3.1, etc).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-26996) Scalar Subquery not handled properly in Spark 2.4

Reply via email to