[ 
https://issues.apache.org/jira/browse/SPARK-24177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463494#comment-16463494
 ] 

Ajay Monga commented on SPARK-24177:
------------------------------------

The suspicion has been strengthened by the fact that when the query is written 
in such a fashion that the date shift logic is put into the SELECT clause and 
then the join is done, the result is correct and consistent across runs.

> Spark returning inconsistent rows and data in a join query when run using 
> Spark SQL (using SQLContext.sql(...))
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-24177
>                 URL: https://issues.apache.org/jira/browse/SPARK-24177
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>         Environment: Production
>            Reporter: Ajay Monga
>            Priority: Major
>
> Spark SQL is returning inconsistent result for a JOIN query. It returns 
> different rows and the value of the column on which a simple multiplication 
> takes place returns different values:
> The query is like:
> SELECT
> second_table.date_value, SUM(XXX * second_table.shift_value)
> FROM
> (
>  SELECT
>  date_value, SUM(value) as XXX
>  FROM first_table
>  WHERE
>  AND date IN ( '2018-01-01', '2018-01-02' )
>  GROUP BY date_value
> )
> intermediate LEFT OUTER
> JOIN second_table ON second_table.date_value = (<Logic to change the 
> 'date_value' from first table, say if it's a Saturday or Sunday then use 
> Monday, else next valid working date>)
> AND second_table.date_value IN (
>  '2018-01-02',
>  '2018-01-03'
> )
> GROUP BY second_table.date_value
>  
> Suspicion is that, the execution of above query is split into two queries - 
> one for first_table and other for second_table before joining. Then the 
> result gets split across partitions, seemingly grouped/distributed by the 
> join column, which is 'date_value'. In the join there is a date shift logic 
> that fails to join in some cases when it should, primarily for the 
> date_values at the edge of the partitions across the Spark cluster. So, it's 
> dependent on how the data (or the rdd) of the individual queries is 
> partitioned in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to