Re: [PR] Per file filter evaluation [datafusion]

via GitHub Sun, 13 Apr 2025 20:19:56 -0700


jayzhan211 commented on PR #15057:
URL: https://github.com/apache/datafusion/pull/15057#issuecomment-2800373423


   > PhysicalExpr::with_schema
   
   This method is too general and it is unclear what we need to do with the 
provided schema for each PhysicalExpr, it is not a good idea.
   
   > I suspect the hard bit with this approach will be edge cases: what if a 
filter cannot adapt itself to the file schema, but we could cast the column to 
make it work? I'm thinking something like a UDF that only accepts Utf8 but the 
the file produces Utf8View
   
   I think it is unavoidable we need to cast the columns to be able to evaluate 
the filter.
   
   Another question is, isn't the filter created based on table schema? And 
then the batch is read as file schema and casted to table schema and is 
evaluated by filter. What we could do is rewrite the filter based on file 
schema. Assume we have `cast(a, i64) = 100`, `a` is i32 in table schema and i64 
in file schema. We rewrite it to `cast(cast(a,i32),i64) = 100` and then 
optimize it with `a = 100`. In your example where udf only accepts utf8, we 
know that no optimization we could do so we just end up additional casting from 
file schema to table schema.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Per file filter evaluation [datafusion]

Reply via email to