adriangb commented on issue #15780:
URL: https://github.com/apache/datafusion/issues/15780#issuecomment-2902428819

   > > "casting should not be changed after planning"
   > 
   > if i said exactly that I should stand corrected. _coercions_ is somewhat 
that should be applied during analysis/initial planning phase. Coercion rules 
result in casts being inserted into the plan. After the initial plan is fully 
formed, the word "coercion" does not exist anymore.
   > 
   > The casts are same category as function calls -- the optimizer may 
reorganize or replace function calls with other expressions as long as they are 
_equivalent_ (and are believed to "be better"). Casts can be removed or 
replaced the same way 
   (again: as long as the resulting expression is well formed and equivalent).
   
   Thanks for correcting me! That's the sort of distinction I knew you'd be 
able to make that I was lacking. It's a helpful way to think about it
   
   > from the issue description:
   > 
   > > So when the filter gets into ParquetSource it's an Int32 filter. But 
when we read the file schema it's actually an Int8!
   > 
   > Where does Int8 come back?
   > Anyway, as the example shows, two different files may have two different 
internal representation for the same SQL-level column. I.e. the table may 
declare Int64, but the file may contain Int32 or Int16. (This is not limited to 
various Int bitnesses). The Parquet source which deals with individual files 
may perform similar logic to `unwrap_cast` optimizer. Does it matter though?
   
   That's the point: we need to do similar logic to `unwrap_cast`, which is non 
trivial I think.
   I started down that road and got confused about some cases so I backed out.
   For example, if you have a table schema of `col1: Int64, col2: Int64` and a 
predicate `col1 = col2` there will be no casts at the logical level. But when 
you get to the file level you have the schema `col1: Int8, col2: UInt32`, now 
you have to do something more similar to coercion I think (i.e. introduce some 
casts)? What would you suggest we do in this case?
   
   Basically I think we need to all agree that this complexity is the right way 
to go and then agree on what to do in the different scenarios.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to