[ 
https://issues.apache.org/jira/browse/SPARK-26777?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16771304#comment-16771304
 ] 

Ilya Peysakhov commented on SPARK-26777:
----------------------------------------

[~hyukjin.kwon]


[~yuri.budilov]

 

[~kabhwan]

 

Hello folks,

 

i went ahead and ran this in vanilla Scala Spark 2.4 on windows, EMR 5.20 
(scala, spark 2.4) and EMR 5.17 (scala, spark 2.3.1) and have replicated the 
issue. 

 

Here is the code

 

spark.sql("select '2018-01-01' as latest_date, 'source1' as source UNION ALL 
select '2018-01-02', 'source2' UNION ALL select '2018-01-03' , 'source3' UNION 
ALL select '2018-01-04' ,'source4' ").write.save("/latest_dates")
val mydatetable = spark.read.load("/latest_dates")
mydatetable.createOrReplaceTempView("latest_dates")


spark.sql("select 50 as mysum, '2018-01-01' as date UNION ALL select 100, 
'2018-01-02' UNION ALL select 300, '2018-01-03' UNION ALL select 3444, 
'2018-01-01' UNION ALL select 600, '2018-08-30' 
").write.partitionBy("date").save("/mypartitioneddata")
val source1 = spark.read.load("/mypartitioneddata")
source1.createOrReplaceTempView("source1")


spark.sql("select max(date), 'source1' as category from source1 where date >= 
(select latest_date from latest_dates where source='source1') ").show

 

The error comes up in vanilla spark 2.4 on windows and EMR 5.20, not vanilla 
spark 2.3.1 or EMR 5.17. This is definitely not an AWS/EMR issue. 

 

Let me know if you need more details. Should i post a new issue?

I did notice that if you just use a subquery on hardcoded values, this does not 
happen, it only happens when you subquery an existing table (like in the method 
above, reading data into a view, or in AWS using the Glue catalog). 

 

> SQL worked in 2.3.2 and fails in 2.4.0
> --------------------------------------
>
>                 Key: SPARK-26777
>                 URL: https://issues.apache.org/jira/browse/SPARK-26777
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.4.0
>            Reporter: Yuri Budilov
>            Priority: Major
>
> Following SQL worked in Spark 2.3.2 and now fails on 2.4.0 (AWS EMR Spark)
> ------------ PySpark call below:
> spark.sql("select partition_year_utc,partition_month_utc,partition_day_utc \
> from datalake_reporting.copy_of_leads_notification \
> where partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification) \
> and partition_month_utc = \
>  (select max(partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m \
>  where \
>  m.partition_year_utc = (select max(partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification)) \
>  and partition_day_utc = (select max(d.partition_day_utc) from 
> datalake_reporting.copy_of_leads_notification as d \
>  where d.partition_month_utc = \
>  (select max(m1.partition_month_utc) from 
> datalake_reporting.copy_of_leads_notification as m1 \
>  where m1.partition_year_utc = \
>  (select max(y.partition_year_utc) from 
> datalake_reporting.copy_of_leads_notification as y) \
>  ) \
>  ) \
>  order by 1 desc, 2 desc, 3 desc limit 1 ").show(1,False)
> Error: (no need for data, this is syntax).
> py4j.protocol.Py4JJavaError: An error occurred while calling o1326.showString.
> : java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> scalar-subquery#4495 []
>  
> Note: all 3 columns in query are Partitioned columns - see bottom of the 
> schema)
>  
> Hive EMR AWS Schema is:
>  
> CREATE EXTERNAL TABLE `copy_of_leads_notification`(
> `message.environment.siteorigin` string, `dcpheader.dcploaddateutc` string, 
> `message.id` int, `source.properties._country` string, `message.created` 
> string, `dcpheader.generatedmessageid` string, `message.tags` bigint, 
> `source.properties._enqueuedtimeutc` string, `source.properties._leadtype` 
> string, `message.itemid` string, `message.prospect.postcode` string, 
> `message.prospect.email` string, `message.referenceid` string, 
> `message.item.year` string, `message.identifier` string, 
> `dcpheader.dcploadmonthutc` string, `message.processed` string, 
> `source.properties._tenant` string, `message.item.price` string, 
> `message.subscription.confirmresponse` boolean, `message.itemtype` string, 
> `message.prospect.lastname` string, `message.subscription.insurancequote` 
> boolean, `source.exchangename` string, 
> `message.prospect.identificationnumbers` bigint, 
> `message.environment.ipaddress` string, `dcpheader.dcploaddayutc` string, 
> `source.properties._itemtype` string, `source.properties._requesttype` 
> string, `message.item.make` string, `message.prospect.firstname` string, 
> `message.subscription.survey` boolean, `message.prospect.homephone` string, 
> `message.extendedproperties` bigint, `message.subscription.financequote` 
> boolean, `message.uniqueidentifier` string, `source.properties._id` string, 
> `dcpheader.sourcemessageguid` string, `message.requesttype` string, 
> `source.routingkey` string, `message.service` string, `message.item.model` 
> string, `message.environment.pagesource` string, `source.source` string, 
> `message.sellerid` string, `partition_date_utc` string, 
> `message.selleridentifier` string, `message.subscription.newsletter` boolean, 
> `dcpheader.dcploadyearutc` string, `message.leadtype` string, 
> `message.history` bigint, `message.callconnect.calloutcome` string, 
> `message.callconnect.datecreatedutc` string, 
> `message.callconnect.callrecordingurl` string, 
> `message.callconnect.transferoutcome` string, 
> `message.callconnect.hiderecording` boolean, 
> `message.callconnect.callstartutc` string, `message.callconnect.code` string, 
> `message.callconnect.callduration` string, `message.fraudnetinfo` string, 
> `message.callconnect.answernumber` string, `message.environment.sourcedevice` 
> string, `message.comments` string, `message.fraudinfo.servervariables` 
> bigint, `message.callconnect.servicenumber` string, 
> `message.callconnect.callid` string, `message.callconnect.voicemailurl` 
> string, `message.item.stocknumber` string, 
> `message.callconnect.answerduration` string, `message.callconnect.callendutc` 
> string, `message.item.series` string, `message.item.detailsurl` string, 
> `message.item.pricetype` string, `message.item.description` string, 
> `message.item.colour` string, `message.item.badge` string, 
> `message.item.odometer` string, `message.environment.requestheader` string, 
> `message.item.registrationnumber` string, `message.item.bodytype` string, 
> `message.item.fueltype` string, `message.item.redbookcode` string, 
> `message.item.spotid` string, `message.item.id` string, 
> `message.item.transmission` string, `message.item.vin` string, 
> `message.item.enginedescription` string, `message.prospect.mobilephone` 
> string, `message.prospect.membertrackingid` string, 
> `message.environment.username` string, `message.prospect.workphone` string, 
> `message.environment.servername` string, `message.environment.sessionid` 
> string, `message.tradein.type` string, `message.tradein.model` string, 
> `message.tradein.year` string, `message.tradein.make` string, 
> `message.tradein.kms` string, `message.fraudinfo.servertimestamp` string, 
> `message.prospect.suburb` string, `message.callconnect.username` string, 
> `message.callconnect.password` string, `message.status` string, 
> `message.tradein.colour` string, `message.prospect.address` string, 
> `message.prospect.state` string, `message.tradein.detailsurl` string, 
> `message.prospect.faxnumber` string, `message.prospect.companyname` string, 
> `message.prospect.title` string, `message.callconnect.calloutcometext` 
> string, `message.prospect.preferredcontacttime` string, 
> `message.fraudinfo.devicedata` string, 
> `message.prospect.preferredcontactmethod` string, 
> `message.assignment.assigned` string, `message.assignment.email` string, 
> `message.assignment.name` string, `message.country` string, 
> `message.financepackage.id` string, `message.financepackage.version` string)
> PARTITIONED BY ( `partition_year_utc` string, `partition_month_utc` string, 
> `partition_day_utc` string, `job_run_guid` string)
> ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION 's3://datalake/yurib_test/leads_notification'
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to