paleolimbot opened a new issue, #22079:
URL: https://github.com/apache/datafusion/issues/22079

   ### Describe the bug
   
   After https://github.com/apache/datafusion/pull/21322 and 
https://github.com/apache/datafusion/pull/21390 the `Cast` "to" field that was 
added in https://github.com/apache/datafusion/pull/18136 to be able to 
represent a cast to an extension type (by including field metadata on the cast 
target) doesn't have its field metadata consistently represented on the output 
`expr.to_field()`. Instead, the metadata is propagating from the input which 
often does not make sense (particularly in the intended usage of the cast 
`field` where the metadata is meaningful)
   
   The main pieces of code that affect the ability of the `Expr::Cast` to 
represent a cast to an extension type are where the expression field is 
resolved in the logical expression:
   
   
https://github.com/apache/datafusion/blob/53517af674e2540805dab0f688f324537ada3220/datafusion/expr/src/expr_schema.rs#L74-L88
   
   ...and where it is resolved in a physical expression:
   
   
https://github.com/apache/datafusion/blob/53517af674e2540805dab0f688f324537ada3220/datafusion/physical-expr/src/expressions/cast.rs#L153-L166
   
   An example of where this results in output that is non-sensical/may be 
rejected by consumers is a cast from a UUID extension type to Utf8 (which will 
now happily execute but output the arrow.uuid extension name on the Utf8 
storage, which is not allowed). Conversely, a cast from Utf8 to UUID *may* work 
for strings that are 16 characters long (maybe, I forget the casting rules 
between string and binary), but definitely would not cast to a UUID (before the 
two noted changes this would have given a clear error.
   
   There are probably some casts that make sense to execute in this way (e.g., 
a cast where the data type is identical to the input, while usually simplified 
away, could in theory propagate extension type metadata safely). I am not sure 
whether it makes sense for other metadata to be propagated in this way (the 
usage of non-extension metadata is highly variable). I personally think it is 
safer to strip metadata in a cast (for the same reason that we strip metadata 
when executing a scalar function).
   
   ### To Reproduce
   
   Cast a UUID column to a string and look at the field metadata / try to 
consume with pyarrow.
   
   ### Expected behavior
   
   I would have expected a cast to UUID to error (because the result is not a 
UUID).
   
   ### Additional context
   
   In https://github.com/apache/datafusion/pull/21071 I'm close to a mechanism 
to actually execute casts to and from extension types, where the output field 
of a cast to UUID really is a UUID. I can also isolate a smaller change from 
that PR to fix this behaviour.
   
   cc @adriangb since we discussed this behaviour in one of the above PRs and 
in the extension types work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to