[
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-24177:
---------------------------------
Labels: bulk-closed (was: )
> Spark returning inconsistent rows and data in a join query when run using
> Spark SQL (using SQLContext.sql(...))
> ---------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-24177
> URL: https://issues.apache.org/jira/browse/SPARK-24177
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.6.0
> Environment: Production
> Reporter: Ajay Monga
> Priority: Major
> Labels: bulk-closed
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns
> different rows and the value of the column on which a simple multiplication
> takes place returns different values:
> The query is like:
> SELECT
> second_table.date_value, SUM(XXX * second_table.shift_value)
> FROM
> (
> SELECT
> date_value, SUM(value) as XXX
> FROM first_table
> WHERE
> AND date IN ( '2018-01-01', '2018-01-02' )
> GROUP BY date_value
> )
> intermediate LEFT OUTER
> JOIN second_table ON second_table.date_value = (<Logic to change the
> 'date_value' from first table, say if it's a Saturday or Sunday then use
> Monday, else next valid working date>)
> AND second_table.date_value IN (
> '2018-01-02',
> '2018-01-03'
> )
> GROUP BY second_table.date_value
>
> Suspicion is that, the execution of above query is split into two queries -
> one for first_table and other for second_table before joining. Then the
> results get split across partitions, seemingly grouped/distributed by the
> join column, which is 'date_value'. In the join there is a date shift logic
> that fails to join in some cases when it should, primarily for the
> date_values at the edge of the partitions distributed across the executors.
> So, the execution is dependent on how the data (or the rdd) of the individual
> queries is partitioned in the first place, which is not ideal as a normal
> looking ANSI standard SQL query is not behaving consistently.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]