thinkharderdev commented on issue #12454:
URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2351814850

   > It still not very clear for me what exactly required to be done on DF side
   
   The basic idea (which I have implemented on our internal fork here 
https://github.com/coralogix/arrow-datafusion/pull/269) is to provide a hook in 
DataFusion which can be passed in through the `TaskContext`. The hook is to 
allow `HashJoinStream` to share state between partitions potentially running on 
different nodes. Specifically, it is sufficient to just share the bitmask of 
matched rows on the build side, then the last partition to execute can emit the 
unmatched rows. 
   
   > Referring to Parallel HJ concepts 
https://www.youtube.com/watch?v=QCTyOLvzR88 some external process needs to 
repartition(it can be hash or by specific columns) the data before sending it 
to the nodes. More on partitioning strategies is in 
https://www.youtube.com/watch?v=S40K8iGa8Ek
   
   Hash partitioning is the standard way to do this in general I agree but its 
just not possible in practice to shuffle that much data as required by hash 
partitioning. The best performing way to do it by far is a broadcast join which 
is currently not possible.
   
   My opinion personally is that since DataFusion is being used by a number of 
organizations to build distributed query engines (even if DF itself is not 
that) it is reasonable to add hooks/extension points in DF to support that use 
case as long as
   1. It doesn't add any measurable overhead to single node query execution (eg 
it should be explicitly opt-in)
   2. It doesn't add significant code complexity and maintenance burden
   
   If this conversation makes sense to concretize this on a particular code 
change I'm happy to submit a PR here to frame things. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to