paleolimbot commented on issue #22079: URL: https://github.com/apache/datafusion/issues/22079#issuecomment-4431486725
Possibly the core issue is that we don't separate types and metadata (I wish we had/this was possible in arrow-rs!) so we had to put a FieldRef everywhere. Everywhere in DataFusion that I know about, that FieldRef-that-maybe-should-actually-have-been-a-data-type-of-some-kind's metadata is propagated *except* for the cast. How about: - `Expr::to_field()`reflects the `Field` metadata specified in the `Cast` - Whoever creates the `Expr::Cast` can propagate metadata when creating the cast if they feel it's safe to do so - The `CastExtension` I have prototyped in https://github.com/apache/datafusion/pull/21071 can handle this today (one can just add a cast extension that handles propagating specific metadata that a system knows about. > Would a good compromise be that arrow extension type metadata specifically is wiped and comes only from the target field but other arbitrary metadata is preserved? That would unblock my specific use of casting to a `FieldRef`. I'm also happy to PR this change. > I think part of the problem is that metadata can serve many purposes. Extension types are just one of them. Can you give some examples where metadata is communicating non-type information that would be useful to propagate and requires a cast and not a scalar function? In the PR that addresses this I can put them in a comment for future readers. The uses of arrow field metadata I know about are all basically trying to communicate type information or statistics, both of which can be fishy through a cast. Part of this is coming from quite a lot of previous work with GeoParquet, where we tried to communicate type information in Parquet metadata that was aggressively propagated via Arrow schema metadata and rather easily resulted in metadata that was capable of causing silently incorrect results. > But I don’t think it would be correct either for select col::text to strip arbitrary metadata Whether it's correct or not, it is behaviour that has existed in 53 versions of DataFusion 🙂 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
