kosiew opened a new issue, #19673:
URL: https://github.com/apache/datafusion/issues/19673

   In 
[`row_filter.rs`](https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/row_filter.rs),
 the `required_columns` field in both `PushdownChecker` and `PushdownColumns` 
structs uses `BTreeSet<usize>` to track column indices required for filter 
evaluation.
   
   **Current implementation:**
   ```rust
   struct PushdownChecker<'schema> {
       required_columns: BTreeSet<usize>,
       // ... other fields
   }
   
   struct PushdownColumns {
       required_columns: BTreeSet<usize>,
       nested: NestedColumnSupport,
   }
   ```
   
   ### Motivation
   
   For typical filter predicates, the number of columns referenced is usually 
small (1-5 columns). Using `BTreeSet<usize>` adds overhead:
   - Memory overhead from tree structure
   - Insertion cost is O(log n) vs O(1) for append to `Vec`
   - The only operation that benefits from `BTreeSet` is deduplication, but for 
small sets, a simple `Vec` with linear scan would be faster
   - The data is immediately converted to `Vec` after collection anyway (line 
244)
   
   ### Proposed Solution
   
   Replace `BTreeSet<usize>` with `Vec<usize>` and handle deduplication 
explicitly if needed
   
   
   source: 
https://github.com/apache/datafusion/pull/19545#discussion_r2665094677
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to