[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

GitBox Sun, 02 Aug 2020 22:41:03 -0700


leanken commented on pull request #29304:
URL: https://github.com/apache/spark/pull/29304#issuecomment-667814305



   > @agrawaldevesh I am finally understand the complexity of multi column 
support, thanks to your remind again and again, feel sorry about my naive. Do 
you think it still worth to carry on to support multi column? sincerely ask for 
you suggestion.
   
   as for how to support it, i think it might be
   
   1. scan buildSide to gather information about which columns contains null
   2. build HashedRelation with original input include anyNull Key
   3. building a extra HashedRelation which is all combination null padding.
   
   when probe doing on streamedSide
   1. if streamedSide key is a all non-null value, using the gathered null 
information on right side, to try find match in original HashedRelation, for 
example (1,2,3) with buildSide c2, c3 with null value, try match using 
following keys
   (1,2,3) (1,null,3)(1,2,null)(1,null,null)
   2. if streamedSide key contains any column which is null value, for example 
(null, 2, 3), use the key to look up in extra hashedRelation because it 
contains all possible combinations.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] leanken commented on pull request #29304: [SPARK-32494][SQL] Null Aware Anti Join Optimize Support Multi-Column

Reply via email to