HeartSaVioR opened a new pull request #23634: [SPARK-26187][SQL] Left/right 
outer join should not return outer nulls for already matched rows
URL: https://github.com/apache/spark/pull/23634
 
 
   ## What changes were proposed in this pull request?
   
   This patch fixes the edge case of left/right outer join described below:
   
   - row L1 and row R1 are joined at batch A
   - row R1 is evicted at batch B due to join and watermark condition, whereas 
row L1 is not evicted
   - row L1 is evicted at batch C
   
   When determining outer rows to match with null, Spark applies some 
assumption commented in codebase, as below:
   
   ```
   Checking whether the current row matches a key in the right side state, and 
that key 
   has any value which satisfies the filter function when joined. If it 
doesn't,        
   we know we can join with null, since there was never (including this batch) 
a match  
   within the watermark period. If it does, there must have been a match at 
some point, so      
   we know we can't join with null.
   ```
   
   But as explained the edge-case earlier, the assumption is not correct. As we 
don't have any good assumption to optimize which doesn't have edge-case, we 
have to track whether such row is matched with others before, and match with 
null row only when the row is not matched.
   
   To track the matching of row, the patch adds a new state to streaming join 
state manager, and mark whether the row is matched to others or not. We 
leverage the information when dealing with eviction of rows which would be 
candidates to match with null rows.
   
   This approach avoids dealing with state versions, but end users need to 
discard their state if the query ran before to get correct result of left/right 
outer join on streaming.
   
   ## How was this patch tested?
   
   Added UT which fails on current Spark and passes with this patch. Also 
passed existing streaming join UTs.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to