paleolimbot commented on issue #8730: URL: https://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880
I think the core of the issue is that almost everybody who types `DataType` or `ArrayRef` intends for this concept to capture both the logical and physical data type, which is an easy mistake to make given that the enum includes Timestamp, Duration, and Interval and the meaning of the `DataType` structures in Spark/Arrow C++/go/pyarrow/Java. For SedonaDB, the appearance of DataType or ArrayRef in arrow-rs or DataFusion is usually a bug, a feature I can't use, or something I have to rewrite (some recent rewrites include the table formatter and the Python UDF framework, which both build on non-extensible arrow-rs APIs built on the DataType and ArrayRef). The places where arrow kernels focus only on the physical data layout are places where it is very easy for DataFusion to introduce bugs for us: when `+`, `-` or cast to operate only on Physical + Datetime portion of the type these are correctness errors for our engine. The recent replacement of FieldRef for DataType is problematic in the other direction...rather than store too little information, it stores too much. The name, nullability, and non-extension metadata sometimes matter and sometimes don't. This usually shows up as schema mismatch errors and/or confusion over how exactly how to propagate the information (e.g., there are two PRs implementing a logical Cast in the DataFusion logical plan that do very different things with all three of those!). These end up affecting everybody, not just users of extension types. Skirting the issue of whether the current options is philosophically correct, it is very easy to misuse and means everybody who wants to support extension types has to reinvent it all. I would much rather work to improve the printing of tables and fidelity of pyarrow type conversion in arrow-rs than SedonaDB. I have to admit that the `FieldType` API doesn't spark much joy for me, nor am I sure whether the propagation of the `extensions` member from Field instance to Field instance will happen in a sufficient number of places to be useful (probably an extension collection in our SessionContext or ConfigOptions would have a better chance of being applied consistently to all fields). I wonder if implementing specific extendable functionality may be a way to start. The problem of supporting (combinations of) extension types in the pretty printing of batches (including dictionaries, lists, and structs that contain them) highlights a few of these issues, as does `Schema::try_merge()` (variant in particular will be an issue for that one). I think those have broad applicability beyond DataFusion although DataFusion would certainly beneft. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
