jorgecarleitao commented on pull request #7967: URL: https://github.com/apache/arrow/pull/7967#issuecomment-678739758
This is currently failing for an interesting reason, and I need some help in decision making. TL;DR options: 1. wait for #8024 and stop using the type coercer, since it breaks schema invariance under optimization 2. short-term copy-paste some of the type coercion on the optimizer to the SQL planner I am more inclined to wait for the typing to be debt free before merging this, but I am willing to hack something up here to make have this in, if there is any sense of urgency on it. ------- To recap, the main goal of this PR is to allow scalar UDFs to accept multiple input types during planning (still returning the same output type). This generally speeds up calculations as smaller types are faster than larger types. E.g. numerically, `sqrt(float32) -> float64` is faster than `sqrt(CAST(float32 AS float64)) -> float64`. Due to how the `get_supertype` is implemented and how the sql planner uses it, the statement ```SELECT sqrt(c11) FROM ...``` is (in master) converted to a logical plan already with a coercion: the column is named `"sqrt(CAST(c11 AS float64))"` when `c11` is not a float64. This PR removed the coercion from the sql planner as I was trying to not have to copy-paste code the new logic from the type coercer into the SQL planner. However, because our type coercer currently breaks schema invariance under optimization, this now fails the test of that invariant: the plan's schema has a field named `"sqrt(c11)"`, but the type coercer changes it to `"sqrt(CAST(c11 AS float64))"`. Note that this is independent of the changes that this PR proposes - even with a single allowed datatype, we still need to perform numerical coercion. What changed here was that the type coercion rules for multiple data types are a bit more complex, and I was trying to not have to copy-paste code from the type coercer into the SQL planner (in other words, there is some coupling between the two via the `get_supertype`). IMO the root cause is the type coercer optimizer, that breaks schema invariance because it operates on the logical plane. One option is to stop using the type coercer optimizer at the logical level, and instead apply cast operations at the physical level, as #8024 moves towards that. Another option is to copy-paste [the type coercion that this PR proposes on type coercer] to the SQL planner, thus guaranteeing that the coercion is consistent between the planner and the type coercer. @andygrove and @alamb , what do you think? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
