adriangb commented on PR #15057:
URL: https://github.com/apache/datafusion/pull/15057#issuecomment-2800002196
I would like to resume this work.
Some thoughts should the rewrite happen via a new trait as I'm currently
doing, or should we add a method `PhysicalExpr::with_schema`?
If we add `with_schema` what schema do we pass it? The actual file schema?
There's something to be said for that: it could rewrite filters to case the
literals / filters instead of casting the columns/arrays [as is currently
done](https://github.com/pydantic/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/datasource-parquet/src/row_filter.rs#L146),
which should be cheaper. I expect that any time it was okay to cast the data
it was also okay to cast the predicate itself. It could also absorb the work of
[reassign_predicate_columns](https://github.com/pydantic/datafusion/blob/0b01fdf7f02f9097c319204058576f420b9790b4/datafusion/datasource-parquet/src/row_filter.rs#L123)
(we implement it for `Column` such that if it's index doesn't match but
another one does it swaps).
I suspect the hard bit with this approach will be edge cases: what if a
filter _cannot_ adapt itself to the file schema, but we could cast the column
to make it work? I'm thinking something like a UDF that only accepts `Utf8` but
the the file produces `Utf8View` 🤔
I think @jayzhan-synnada proposed something similar in
https://github.com/apache/datafusion/pull/15685/files#diff-2b3f5563d9441d3303b57e58e804ab07a10d198973eed20e7751b5a20b955e42.
@alamb any thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]