adriangb commented on issue #7955: URL: https://github.com/apache/datafusion/issues/7955#issuecomment-2830822721
> Modify HashJoinExec to build a bloom filter on the build side, and when complete call DynamicFilterPhysicalExpr::update Pretty much: once `HashjoinExec` has completed the build side it builds a bloom filter and wraps it up in a PhysicalExpr. I don't think you **need** to use a `DynamicFilterPhysicalExpr` since once you build the hash table / bloom filter it's never updated. The reason to have `DynamicFilterPhysicalExpr` is e.g. in a `TopK` operator where every RecordBatch you process you gain new information to do more selective pruning (new heap values). So I'd think you could just pass the hardcoded `PhysicalExpr` directly? To your second point: one thing to consider is if you want this to be compatible with any sort of distributed query execution. If you do then one of two things would need to happen: 1. Teach the serialization how to serialize the bloom filter PhysicalExpr -> in practice update the protobuf serialization to match against it. 2. Implement `PhysicalExpr::snapshot()` to convert it into an `InList` or something. Given all of this my recommendation would be to build a dedicated `BloomFilter` `PhysicalExpr` and tech DataFusion how to serialize it to ProtoBuf. Then don't use `DynamicFilterPhysicalExpr` or `PhysicalExpr::snapshot()`. The nice thing about this is that the new bloom filter expr could be re-used in other places, e.g. `InList` could build a bloom filter instead of a HashSet for pre-filtering (I think that's what it currently does). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org