tobixdev commented on issue #18223: URL: https://github.com/apache/datafusion/issues/18223#issuecomment-3471972334
A reply to phttps://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880 as some of the issues seem to be more fitting to this issue. Firstly, thank you very much for your input! > I think the core of the issue is that almost everybody who types DataType or ArrayRef intends for this concept to capture both the logical and physical data type, which is an easy mistake to make given that the enum includes Timestamp, Duration, and Interval and the meaning of the DataType structures in Spark/Arrow C++/go/pyarrow/Java. > > For SedonaDB, the appearance of DataType or ArrayRef in arrow-rs or DataFusion is usually a bug, a feature I can't use, or something I have to rewrite (some recent rewrites include the table formatter and the Python UDF framework, which both build on non-extensible arrow-rs APIs built on the DataType and ArrayRef). The places where arrow kernels focus only on the physical data layout are places where it is very easy for DataFusion to introduce bugs for us: when +, - or cast to operate only on Physical + Datetime portion of the type these are correctness errors for our engine. > > The recent replacement of FieldRef for DataType is problematic in the other direction...rather than store too little information, it stores too much. The name, nullability, and non-extension metadata sometimes matter and sometimes don't. This usually shows up as schema mismatch errors and/or confusion over how exactly how to propagate the information (e.g., there are two PRs implementing a logical Cast in the DataFusion logical plan that do very different things with all three of those!). These end up affecting everybody, not just users of extension types. That's a very interesting point! Would using the existing [LogicalType] instead of `Field` resolve this problem? There we don't distinguish between, for example, `Utf8` and `Utf8View`. > [...] nor am I sure whether the propagation of the extensions member from Field instance to Field instance will happen in a sufficient number of places to be useful [...] I think this is a fair critique point here. IMO, at some utopian future, having validations in-place that check whether the stored logical type matches the type metadata could prevent many issues. Nevertheless, this will make problems until the usage of logical types is applied consistently. Maybe some workarounds such as an initial "optimizer pass" that sets the correct logical types can help to ease the migration. > [...] (probably an extension collection in our SessionContext or ConfigOptions would have a better chance of being applied consistently to all fields). I think the extension collection in `SessionContext` will be central to any design of extension types. What I am exploring is that you can, in addition to the collection in `SessionContext`, pass around the reference to the logical type within `Expr`s and `Field`s etc. I guess this is similar to how we have a central UDF registry but also carry the reference to the appropriate UDF for each `Expr::ScalarFunction` which allows us to work with the scalar function call (e.g., printing) without referencing the registry again. > I wonder if implementing specific extendable functionality may be a way to start. The problem of supporting (combinations of) extension types in the pretty printing of batches (including dictionaries, lists, and structs that contain them) highlights a few of these issues, as does Schema::try_merge() (variant in particular will be an issue for that one). I think those have broad applicability beyond DataFusion although DataFusion would certainly beneft. In my head, the idea is that the logical type reference will provide access to these extensions. For example, the `PrettyPrintExtension` from above can be retrieved from a particular `LogicalType` and then be used to print `Array` values and `ScalarValue`s. Then there would be two ways of retrieving the extension: Use a "regular" `Field` (or similar), obtain the logical type from the registry and call the printer (this would always be an "escape hatch" option) or if the pointer to the `LogicalType` is stored inline, skip the call to the registry. To be concrete there is some design sketch below. Please point out any flaws in that design if you catch anything. I am really unsure whether this can cater to all the needs. Extend the `LogicalType`: ```rust pub trait LogicalType: { // .... /// Returns a reference to the pretty printer for this type. fn pretty_printer(&self) -> &dyn ValuePrettyPrinter; // Other extensions will follow (e.g., built-in operators, comparisons, sorting, etc.) } ``` The printer extensions (the function signatures are subject to change): ```rust /// Implements pretty printing for a set of types. /// /// For example, the default pretty-printer for a byte array might not be adequate for a UUID type, /// which is physically stored as a fixed-length byte array. This extension allows the user to /// override the default pretty-printer for a given type. pub trait ValuePrettyPrinter { /// Pretty print a scalar value. /// /// # Error /// /// Will return an error if the given `value` is not supported by this pretty printer. fn pretty_print_scalar(&self, value: &ScalarValue) -> Result<String>; /// Pretty print a specific value of a given array. /// /// # Error /// /// Will return an error if the `array` value at `index` is not supported by this pretty printer. fn pretty_print_array(&self, array: &dyn Array, index: usize) -> Result<String> { let value = ScalarValue::try_from_array(array, index)?; self.pretty_print_scalar(&value) } } ``` And then in `impl Display for Expr`: ```rust Expr::Literal(v, metadata) => { let logical_type = metadata.logical_type(); // Assumes that this is now stored somewhere in-line. let pretty_printer = logical_type.pretty_printer(); // Will return a default implementation for `NativeType`. let v = pretty_printer.pretty_print_scalar(v); match metadata.as_ref().map(|m| m.is_empty()).unwrap_or(true) { false => write!(f, "{v} {:?}", metadata.as_ref().unwrap()), true => write!(f, "{v}"), } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
