mbutrovich commented on code in PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#discussion_r3221745541
##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -1323,6 +1464,65 @@ impl PushDecoderStreamState {
type ConstantColumns = HashMap<String, ScalarValue>;
+/// Return `base` unchanged when `extra` is empty; otherwise build a new schema
+/// with `extra` appended to `base`'s fields.
+fn append_fields(base: &SchemaRef, extra: &[FieldRef]) -> SchemaRef {
+ if extra.is_empty() {
+ return Arc::clone(base);
+ }
+ let fields = base
+ .fields()
+ .iter()
+ .cloned()
+ .chain(extra.iter().cloned())
+ .collect::<Vec<_>>();
+ Arc::new(Schema::new(fields))
+}
+
+/// Reject predicates that reference a virtual column when filter pushdown is
+/// enabled.
+///
+/// arrow-rs's `RowFilter` evaluates predicates against a `ProjectionMask` that
+/// addresses parquet leaves only; virtual columns (e.g. `row_number`) are
+/// synthesized by the reader *after* filter evaluation and cannot be
referenced
+/// inside a row filter. Silently dropping such a predicate would produce wrong
+/// results, so we fail loudly here.
+///
+/// `ParquetSource::try_pushdown_filters` already classifies virtual-column
+/// filters as `PushedDown::No` so the `FilterPushdown` optimizer leaves them
+/// above the scan; this check is defense-in-depth for callers that build plans
+/// manually and set `with_pushdown_filters(true)` alongside a predicate
+/// referencing virtual columns.
+fn validate_predicate_does_not_reference_virtual_columns(
Review Comment:
Thanks @adriangb, want to make sure I understand the intent before I
refactor, since the two consumer shapes we have land in different places:
- DataFusion front-end path: planner calls `try_pushdown_filters`, which
already reports virtual-col filters as `PushedDown::No` and excludes them from
`source.predicate`. The `FilterExec` stays above the scan, so the opener never
sees a virtual-col ref in the first place.
- Direct-opener consumers (Comet, and anyone else bypassing the optimizer):
they construct the predicate and set `with_pushdown_filters(true)` themselves.
No `try_pushdown_filters` call, no guarantee the predicate was split, no
`FilterExec` above unless they built one.
The opener-level check only matters for the second group. If we silently
split and drop virtual-col conjuncts, those callers get wrong results with no
signal; the current error tells them the contract and how to fix it.
Is your suggestion:
- (a) Drop the opener check entirely and rely on the `try_pushdown_filters`
boundary, accepting that direct-opener callers are on their own?
- (b) Keep the check but split the predicate in the opener so non-virtual
conjuncts still participate in row-filtering / pruning, rather than erroring on
the whole predicate?
- Something else?
Happy to go either way, just want to make sure we're not setting a footgun
for the Comet shape.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]