sdf-jkl commented on code in PR #9414:
URL: https://github.com/apache/arrow-rs/pull/9414#discussion_r3005452406


##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -430,6 +434,32 @@ impl<T> ArrowReaderBuilder<T> {
             ..self
         }
     }
+
+    /// Set a scatter threshold for filter deferral.
+    ///
+    /// The threshold is the maximum allowed **selector density**
+    /// (`selector_count / row_count`). If applying a predicate would
+    /// produce a density above this value, its result is deferred.
+    /// For example, `0.25` allows at most 25 selectors per 100 rows.
+    ///
+    /// A high selector density means many small skip/read transitions,

Review Comment:
   Yes, but low selector density doesn't guarantee that skip/read transitions 
are large either.



##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -430,6 +434,32 @@ impl<T> ArrowReaderBuilder<T> {
             ..self
         }
     }
+
+    /// Set a scatter threshold for filter deferral.

Review Comment:
   I think it's great, but this heuristic doesn't know anything about the 
distribution of data. Low selector density doesn't guarantee that skip/read 
transitions are large.
   
   For example:
   ```
   total_rows = 10_000
   selector_count = 100
   density = 100 / 10_000 = 0.01   (at threshold)
   
   RowSelection (100 selectors total):
   1)  select 4951 rows
   2)  skip 1
   3)  select 1
   4)  skip 1
   ...
   99) skip 1
   100) select 4951 rows
   ```
   Which wouldn't be good:
   ```
   [ SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ][XSXSXSXSXS...XS][ 
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ]
     ~4951 selected                                     98 rows            
~4951 selected
   ```
   One way to catch this is by checking selectivity. `49 / 10_000 = 0.0049` 
(Very low)
   
   ----
   
   I propose keeping both `selector_density` and `selectivity_threshold` you 
had before. 
   
   This matrix shows the possible cases:
   | Density (`selectors/rows`) | Selectivity (`skipped/rows`) | Typical 
Pattern | Pushdown / Deferral Hint |
   |---|---|---|---|
   | Low | Low | Small contiguous skipped island | Would think is good without 
checking the selectivity |
   | Low | High | Large contiguous skips | Great!|
   | High | Low | Many tiny transitions, little skipped overall | Worst case |
   | High | High | Many tiny transitions but lots skipped | Check if it 
increases the `selector_count`  |
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to