sdf-jkl commented on code in PR #9414:
URL: https://github.com/apache/arrow-rs/pull/9414#discussion_r3005452406
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -430,6 +434,32 @@ impl<T> ArrowReaderBuilder<T> {
..self
}
}
+
+ /// Set a scatter threshold for filter deferral.
+ ///
+ /// The threshold is the maximum allowed **selector density**
+ /// (`selector_count / row_count`). If applying a predicate would
+ /// produce a density above this value, its result is deferred.
+ /// For example, `0.25` allows at most 25 selectors per 100 rows.
+ ///
+ /// A high selector density means many small skip/read transitions,
Review Comment:
Yes, but low selector density doesn't guarantee that skip/read transitions
are large either.
##########
parquet/src/arrow/arrow_reader/mod.rs:
##########
@@ -430,6 +434,32 @@ impl<T> ArrowReaderBuilder<T> {
..self
}
}
+
+ /// Set a scatter threshold for filter deferral.
Review Comment:
I think it's great, but this heuristic doesn't know anything about the
distribution of data. Low selector density doesn't guarantee that skip/read
transitions are large.
For example:
```
total_rows = 10_000
selector_count = 100
density = 100 / 10_000 = 0.01 (at threshold)
RowSelection (100 selectors total):
1) select 4951 rows
2) skip 1
3) select 1
4) skip 1
...
99) skip 1
100) select 4951 rows
```
Which wouldn't be good:
```
[ SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ][XSXSXSXSXS...XS][
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS ]
~4951 selected 98 rows
~4951 selected
```
One way to catch this is by checking selectivity. `49 / 10_000 = 0.0049`
(Very low)
----
I propose keeping both `selector_density` and `selectivity_threshold` you
had before.
This matrix shows the possible cases:
| Density (`selectors/rows`) | Selectivity (`skipped/rows`) | Typical
Pattern | Pushdown / Deferral Hint |
|---|---|---|---|
| Low | Low | Small contiguous skipped island | Would think is good without
checking the selectivity |
| Low | High | Large contiguous skips | Great!|
| High | Low | Many tiny transitions, little skipped overall | Worst case |
| High | High | Many tiny transitions but lots skipped | Check if it
increases the `selector_count` |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]