shurik mermelshtein created SPARK-47034:
-------------------------------------------

             Summary: join between cached temp tables result in missing entries
                 Key: SPARK-47034
                 URL: https://issues.apache.org/jira/browse/SPARK-47034
             Project: Spark
          Issue Type: Bug
          Components: Examples
    Affects Versions: 3.5.0
            Reporter: shurik mermelshtein


we create several temp tables (views) by loading several delta tables and 
joining between them. 
those views are used for calculation of different metrics. each metric requires 
different views to be used. some of the more popular views are cached for 
better performance. 

we have noticed that once we upgraded from spark 3.4.2  to spark 3.5.0 some of 
the join started to fail.

we can reproduce a case were we have 2 data frames (views) (this is not the 
real names  / values we use. this is just for the example)
 # users with the column user_id, campaign_id, user_name.
we make sure it has a single entry
'111111', '22222', 'Jhon Doe'
 # actions with the column user_id, campaign_id, action_id, action count
we make sure it has a single entry
'111111', '22222', 'clicks', 5

 
 # users view can be filtered for user_id = '111111' or/and campaign_id = 
'22222' and it will find the existing single row
 # actions view can be filtered for user_id = '111111' or/and campaign_id = 
'22222' and it will find the existing single row
 # users and actions can be inner join by user_id *OR* campaign_id and the join 
will be successful. 
 # users and actions can *not* be inner join by user_id *AND* campaign_id. The 
join results in no entry.


 # if we write both of the views to S3 and read them back to new data frames, 
suddenly the join is working.
 # if we disable AQE the join is working
 # running checkpoint on the views does not work



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to