Re: [I] Proposal: Hook to better support `CollectLeft` joins in distributed execution [datafusion]

via GitHub Fri, 13 Sep 2024 11:51:22 -0700


alamb commented on issue #12454:
URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2349892215


   One challenge I predict with the above scenario is that it seems to assume 
that the order of rows from the build side will be the same on all nodes across 
all partitions (so you can match up the BooleanBuffer across ndoes)
   
   > This sort of goes against the idea that DataFusion itself is not a library 
for distributed query execution, b
   
   I think adding hooks for making distributed engines is a very reasonable 
discussion
   
   > The correct way to do this in a distributed execution is to use a 
partitioned join and repartition data but this is a problem because data is 
huge and the repartition would require shuffling a potentially massive amount 
of data.
   
   I am not sure this is the "correct" way though it is certainly one way to do 
it.
   
   It seems like the core challenge you are describing is finding all rows in a 
small `facts` table that did not match any rows when joined with all rows of an 
arbitrarily distributed `data`  table
   
   You could manage this via a distributed state as you suggest. Another way 
might be to rewrite the query (automatically) to do the check of which rows 
didn't match on a single node
   
   The single node might seem like a bad idea, but if the `facts` is really 
small the cost of rehashing it is probably low
   
   So do something like
   
   ```
   WITH 
     (SELECT facts.fact_value fact_value, data.id did, data.fact_id fact_id
     FROM facts JOIN data -- Note this is now INNER join, done in distributed 
fashion
     ON data.fact_id = fact.id)
   as join_result
   SELECT 
     f1.fv fv, 
     did,  
     f1.fid fid
   FROM facts f1 OUTER JOIN join_result --- this outer query fills in all 
missing fact rows
   ON f1.fact_id = join_result.fact_id
   ```
   
   The idea is that you run the outer query either on a single node or after 
redistributing both sides on fid
   
   This does mean you have to hash `facts` again, but you don't have to move 
the `data` tables around 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Proposal: Hook to better support `CollectLeft` joins in distributed execution [datafusion]

Reply via email to