tobixdev commented on issue #18223:
URL: https://github.com/apache/datafusion/issues/18223#issuecomment-3471972334

   A reply to 
phttps://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880 as some 
of the issues seem to be more fitting to this issue.
   
   Firstly, thank you very much for your input!
   
   > I think the core of the issue is that almost everybody who types DataType 
or ArrayRef intends for this concept to capture both the logical and physical 
data type, which is an easy mistake to make given that the enum includes 
Timestamp, Duration, and Interval and the meaning of the DataType structures in 
Spark/Arrow C++/go/pyarrow/Java.
   >
   > For SedonaDB, the appearance of DataType or ArrayRef in arrow-rs or 
DataFusion is usually a bug, a feature I can't use, or something I have to 
rewrite (some recent rewrites include the table formatter and the Python UDF 
framework, which both build on non-extensible arrow-rs APIs built on the 
DataType and ArrayRef). The places where arrow kernels focus only on the 
physical data layout are places where it is very easy for DataFusion to 
introduce bugs for us: when +, - or cast to operate only on Physical + Datetime 
portion of the type these are correctness errors for our engine.
   > 
   > The recent replacement of FieldRef for DataType is problematic in the 
other direction...rather than store too little information, it stores too much. 
The name, nullability, and non-extension metadata sometimes matter and 
sometimes don't. This usually shows up as schema mismatch errors and/or 
confusion over how exactly how to propagate the information (e.g., there are 
two PRs implementing a logical Cast in the DataFusion logical plan that do very 
different things with all three of those!). These end up affecting everybody, 
not just users of extension types.
   
   That's a very interesting point! Would using the existing [LogicalType] 
instead of `Field` resolve this problem? There we don't distinguish between, 
for example, `Utf8` and `Utf8View`.
   
   > [...] nor am I sure whether the propagation of the extensions member from 
Field instance to Field instance will happen in a sufficient number of places 
to be useful [...]
   
   I think this is a fair critique point here. IMO, at some utopian future, 
having validations in-place that check whether the stored logical type matches 
the type metadata could prevent many issues. Nevertheless, this will make 
problems until the usage of logical types is applied consistently. Maybe some 
workarounds such as an initial "optimizer pass" that sets the correct logical 
types can help to ease the migration.
   
   > [...] (probably an extension collection in our SessionContext or 
ConfigOptions would have a better chance of being applied consistently to all 
fields).
   
   I think the extension collection in `SessionContext` will be central to any 
design of extension types. What I am exploring is that you can, in addition to 
the collection in `SessionContext`, pass around the reference to the logical 
type within `Expr`s and `Field`s etc. I guess this is similar to how we have a 
central UDF registry but also carry the reference to the appropriate UDF for 
each `Expr::ScalarFunction` which allows us to work with the scalar function 
call (e.g., printing) without referencing the registry again.
   
   > I wonder if implementing specific extendable functionality may be a way to 
start. The problem of supporting (combinations of) extension types in the 
pretty printing of batches (including dictionaries, lists, and structs that 
contain them) highlights a few of these issues, as does Schema::try_merge() 
(variant in particular will be an issue for that one). I think those have broad 
applicability beyond DataFusion although DataFusion would certainly beneft.
   
   In my head, the idea is that the logical type reference will provide access 
to these extensions. For example, the `PrettyPrintExtension` from above can be 
retrieved from a particular `LogicalType` and then be used to print `Array` 
values and `ScalarValue`s. Then there would be two ways of retrieving the 
extension: Use a "regular" `Field` (or similar), obtain the logical type from 
the registry and call the printer (this would always be an "escape hatch" 
option) or if the pointer to the `LogicalType` is stored inline, skip the call 
to the registry.
   
   To be concrete there is some design sketch below. Please point out any flaws 
in that design if you catch anything. I am really unsure whether this can cater 
to all the needs.
   
   Extend the `LogicalType`:
   
   ```rust
   pub trait LogicalType: {
       // ....
   
       /// Returns a reference to the pretty printer for this type.
       fn pretty_printer(&self) -> &dyn ValuePrettyPrinter;
   
      // Other extensions will follow (e.g., built-in operators, comparisons, 
sorting, etc.)
   }
   ```
   
   The printer extensions (the function signatures are subject to change):
   
   ```rust
   /// Implements pretty printing for a set of types.
   ///
   /// For example, the default pretty-printer for a byte array might not be 
adequate for a UUID type,
   /// which is physically stored as a fixed-length byte array. This extension 
allows the user to
   /// override the default pretty-printer for a given type.
   pub trait ValuePrettyPrinter {
       /// Pretty print a scalar value.
       ///
       /// # Error
       ///
       /// Will return an error if the given `value` is not supported by this 
pretty printer.
       fn pretty_print_scalar(&self, value: &ScalarValue) -> Result<String>;
   
       /// Pretty print a specific value of a given array.
       ///
       /// # Error
       ///
       /// Will return an error if the `array` value at `index` is not 
supported by this pretty printer.
       fn pretty_print_array(&self, array: &dyn Array, index: usize) -> 
Result<String> {
           let value = ScalarValue::try_from_array(array, index)?;
           self.pretty_print_scalar(&value)
       }
   }
   ```
   
   And then in `impl Display for Expr`:
   
   ```rust
   Expr::Literal(v, metadata) => {
       let logical_type = metadata.logical_type(); // Assumes that this is now 
stored somewhere in-line. 
       let pretty_printer = logical_type.pretty_printer(); // Will return a 
default implementation for `NativeType`.
       let v = pretty_printer.pretty_print_scalar(v);
   
       match metadata.as_ref().map(|m| m.is_empty()).unwrap_or(true) {
            false => write!(f, "{v} {:?}", metadata.as_ref().unwrap()),
           true => write!(f, "{v}"),
       }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to