adriangb commented on code in PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#discussion_r3221800005


##########
datafusion/datasource-parquet/src/opener.rs:
##########
@@ -1323,6 +1464,65 @@ impl PushDecoderStreamState {
 
 type ConstantColumns = HashMap<String, ScalarValue>;
 
+/// Return `base` unchanged when `extra` is empty; otherwise build a new schema
+/// with `extra` appended to `base`'s fields.
+fn append_fields(base: &SchemaRef, extra: &[FieldRef]) -> SchemaRef {
+    if extra.is_empty() {
+        return Arc::clone(base);
+    }
+    let fields = base
+        .fields()
+        .iter()
+        .cloned()
+        .chain(extra.iter().cloned())
+        .collect::<Vec<_>>();
+    Arc::new(Schema::new(fields))
+}
+
+/// Reject predicates that reference a virtual column when filter pushdown is
+/// enabled.
+///
+/// arrow-rs's `RowFilter` evaluates predicates against a `ProjectionMask` that
+/// addresses parquet leaves only; virtual columns (e.g. `row_number`) are
+/// synthesized by the reader *after* filter evaluation and cannot be 
referenced
+/// inside a row filter. Silently dropping such a predicate would produce wrong
+/// results, so we fail loudly here.
+///
+/// `ParquetSource::try_pushdown_filters` already classifies virtual-column
+/// filters as `PushedDown::No` so the `FilterPushdown` optimizer leaves them
+/// above the scan; this check is defense-in-depth for callers that build plans
+/// manually and set `with_pushdown_filters(true)` alongside a predicate
+/// referencing virtual columns.
+fn validate_predicate_does_not_reference_virtual_columns(

Review Comment:
   I see. I didn't realize Comet sets the filters directly without going 
through `try_pushdown_filters`. My suggestion (which may not be possible) would 
be that Comet and other similar consumers follow the same contract the rest of 
DataFusion uses and go through `try_pushdown_filters`. If needed we could 
extract some bit of the optimizer rule into public functions to avoid you 
having to re-implement the logic of placing a `FilterExec` above for rejected 
filters, etc, but I also think that should not be that complicated.
   
   Does that make sense or are there reasons why that would not work?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to