paleolimbot opened a new issue, #22079: URL: https://github.com/apache/datafusion/issues/22079
### Describe the bug After https://github.com/apache/datafusion/pull/21322 and https://github.com/apache/datafusion/pull/21390 the `Cast` "to" field that was added in https://github.com/apache/datafusion/pull/18136 to be able to represent a cast to an extension type (by including field metadata on the cast target) doesn't have its field metadata consistently represented on the output `expr.to_field()`. Instead, the metadata is propagating from the input which often does not make sense (particularly in the intended usage of the cast `field` where the metadata is meaningful) The main pieces of code that affect the ability of the `Expr::Cast` to represent a cast to an extension type are where the expression field is resolved in the logical expression: https://github.com/apache/datafusion/blob/53517af674e2540805dab0f688f324537ada3220/datafusion/expr/src/expr_schema.rs#L74-L88 ...and where it is resolved in a physical expression: https://github.com/apache/datafusion/blob/53517af674e2540805dab0f688f324537ada3220/datafusion/physical-expr/src/expressions/cast.rs#L153-L166 An example of where this results in output that is non-sensical/may be rejected by consumers is a cast from a UUID extension type to Utf8 (which will now happily execute but output the arrow.uuid extension name on the Utf8 storage, which is not allowed). Conversely, a cast from Utf8 to UUID *may* work for strings that are 16 characters long (maybe, I forget the casting rules between string and binary), but definitely would not cast to a UUID (before the two noted changes this would have given a clear error. There are probably some casts that make sense to execute in this way (e.g., a cast where the data type is identical to the input, while usually simplified away, could in theory propagate extension type metadata safely). I am not sure whether it makes sense for other metadata to be propagated in this way (the usage of non-extension metadata is highly variable). I personally think it is safer to strip metadata in a cast (for the same reason that we strip metadata when executing a scalar function). ### To Reproduce Cast a UUID column to a string and look at the field metadata / try to consume with pyarrow. ### Expected behavior I would have expected a cast to UUID to error (because the result is not a UUID). ### Additional context In https://github.com/apache/datafusion/pull/21071 I'm close to a mechanism to actually execute casts to and from extension types, where the output field of a cast to UUID really is a UUID. I can also isolate a smaller change from that PR to fix this behaviour. cc @adriangb since we discussed this behaviour in one of the above PRs and in the extension types work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
