thinkharderdev commented on issue #12454:
URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2350024323

   > I am not sure this is the "correct" way though it is certainly one way to 
do it.
   
   Yeah, fair. I meant basically that it seems like the way to do it that does 
not require any distributed state shared across distributed execution nodes. 
   
   > So do something like
   
   This was an approach we tried (or something very similar in spirit at least) 
but it breaks down when you have a join + aggregation. E.g. with a query like
   ```
   SELECT data.key, SUM(CASE WHEN facts.fact_value IS NULL THEN 0 ELSE 1) as 
matched, SUM(CASE WHEN facts.fact_value IS NULL THEN 1 ELSE 0) as unmatched
   FROM facts OUTER JOIN data
   ON data.fact_id = fact.id
   GROUP BY data.key
   ```
   
   that tries to calculate the matched vs unmatched rows from `data`.
   
   By adding a coalesce to do the outer join you can no longer fuse the partial 
aggregation with the hash join and you end up back in the position of having to 
shuffle huge amounts of data
   
   FWIW, I implemented my proposal from above on our internal fork of 
DataFusion and it's not terribly intrusive in the DataFusion code (IMO): 
https://github.com/coralogix/arrow-datafusion/pull/269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to