adriangb commented on issue #7955:
URL: https://github.com/apache/datafusion/issues/7955#issuecomment-2830822721

   > Modify HashJoinExec to build a bloom filter on the build side, and when 
complete call DynamicFilterPhysicalExpr::update
   
   Pretty much: once `HashjoinExec` has completed the build side it builds a 
bloom filter and wraps it up in a PhysicalExpr.
   I don't think you **need** to use a `DynamicFilterPhysicalExpr` since once 
you build the hash table / bloom filter it's never updated. The reason to have 
`DynamicFilterPhysicalExpr` is e.g. in a `TopK` operator where every 
RecordBatch you process you gain new information to do more selective pruning 
(new heap values). So I'd think you could just pass the hardcoded 
`PhysicalExpr` directly?
   
   To your second point: one thing to consider is if you want this to be 
compatible with any sort of distributed query execution.
   If you do then one of two things would need to happen:
   1. Teach the serialization how to serialize the bloom filter PhysicalExpr -> 
in practice update the protobuf serialization to match against it.
   2. Implement `PhysicalExpr::snapshot()` to convert it into an `InList` or 
something.
   
   Given all of this my recommendation would be to build a dedicated 
`BloomFilter` `PhysicalExpr` and tech DataFusion how to serialize it to 
ProtoBuf. Then don't use `DynamicFilterPhysicalExpr` or 
`PhysicalExpr::snapshot()`.
   
   The nice thing about this is that the new bloom filter expr could be re-used 
in other places, e.g. `InList` could build a bloom filter instead of a HashSet 
for pre-filtering (I think that's what it currently does).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to