[I] [Proposal] Runtime Filters for DataFusion Comet [datafusion-comet]

via GitHub Wed, 07 Jan 2026 00:24:52 -0800


Shekharrajak opened a new issue, #3053:
URL: https://github.com/apache/datafusion-comet/issues/3053


   ### What is the problem the feature request solves?
   
   Runtime Filters is a high-performance optimization that reduces I/O by 
**90%** and improves query performance  for join-heavy workloads. The feature 
filters data at scan time using lightweight data structures constructed during 
hash join build phases.
   
   Runtime filters are lightweight data structures (IN sets, Min/Max bounds, 
Bloom filters) built during hash join **build phases** and pushed down to 
**scan operators** to filter data before reading from storage.
   
   Example:
   
   ```sql
   -- User writes standard SQL
   SELECT o.*, c.name 
   FROM orders o 
   JOIN customers c ON o.customer_id = c.id 
   WHERE c.country = 'USA';
   ```
   
   1. Comet will detect join opportunity
   2. Builds filter from `customers` table during join build
   3. Applies filter to `orders` scan automatically
   4. Query runs **faster** with **less I/O**
   
   
   ### Filter Types
   
   1. **IN Filter** (Small cardinality <1000)
   
   2. **Min/Max Filter** (Numeric/date types)
      - Min and max bounds
   
   3. **Bloom Filter** (Large cardinality, future)
      - Probabilistic data structure
   
   The system should automatically selects the optimal filter type:
   - Numeric/date → Min/Max Filter (most efficient)
   - Small cardinality → IN Filter
   - Large cardinality → Bloom Filter 
   
   Users should be able to control runtime filters via Spark configuration:
   
   ```scala
   // Enable/disable runtime filters
   spark.conf.set("spark.comet.runtimeFilter.enabled", true)
   
   // Adjust thresholds
   spark.conf.set("spark.comet.runtimeFilter.inFilterThreshold", 1000)
   spark.conf.set("spark.comet.runtimeFilter.bloomFilterFpp", 0.01)
   ```
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Proposal] Runtime Filters for DataFusion Comet [datafusion-comet]

Reply via email to