paleolimbot commented on issue #8730:
URL: https://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880

   I think the core of the issue is that almost everybody who types `DataType` 
or `ArrayRef` intends for this concept to capture both the logical and physical 
data type, which is an easy mistake to make given that the enum includes 
Timestamp, Duration, and Interval and the meaning of the `DataType` structures 
in Spark/Arrow C++/go/pyarrow/Java.
   
   For SedonaDB, the appearance of DataType or ArrayRef in arrow-rs or 
DataFusion is usually a bug, a feature I can't use, or something I have to 
rewrite (some recent rewrites include the table formatter and the Python UDF 
framework, which both build on non-extensible arrow-rs APIs built on the 
DataType and ArrayRef). The places where arrow kernels focus only on the 
physical data layout are places where it is very easy for DataFusion to 
introduce bugs for us: when `+`, `-` or cast to operate only on Physical + 
Datetime portion of the type these are correctness errors for our engine.
   
   The recent replacement of FieldRef for DataType is problematic in the other 
direction...rather than store too little information, it stores too much. The 
name, nullability, and non-extension metadata sometimes matter and sometimes 
don't. This usually shows up as schema mismatch errors and/or confusion over 
how exactly how to propagate the information (e.g., there are two PRs 
implementing a logical Cast in the DataFusion logical plan that do very 
different things with all three of those!). These end up affecting everybody, 
not just users of extension types.
   
   Skirting the issue of whether the current options is philosophically 
correct, it is very easy to misuse and means everybody who wants to support 
extension types has to reinvent it all. I would much rather work to improve the 
printing of tables and fidelity of pyarrow type conversion in arrow-rs than 
SedonaDB.
   
   I have to admit that the `FieldType` API doesn't spark much joy for me, nor 
am I sure whether the propagation of the `extensions` member from Field 
instance to Field instance will happen in a sufficient number of places to be 
useful (probably an extension collection in our SessionContext or ConfigOptions 
would have a better chance of being applied consistently to all fields).
   
   I wonder if implementing specific extendable functionality may be a way to 
start. The problem of supporting (combinations of) extension types in the 
pretty printing of batches (including dictionaries, lists, and structs that 
contain them) highlights a few of these issues, as does `Schema::try_merge()` 
(variant in particular will be an issue for that one). I think those have broad 
applicability beyond DataFusion although DataFusion would certainly beneft.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to