tobixdev opened a new pull request, #15106: URL: https://github.com/apache/datafusion/pull/15106
Related to: #14828, #12644, #14247 This PR is a *rough* proposal for an implementation of user-defined sorting. The main goal is to get a discussion starting whether this is a direction we want to go. Some design considerations that help understand the PR: - `PhysicalSortExpr` allows defining a custom sort order, but does not know anything about extension types. - "Resolving" logical types happens during the creation of the initial physical plan. - `LogicalTypePlanningInformation` allows parameterizing this creation process - Obtaining the logical type is currently done by inspecting the `Field` whether there is an entry in the metadata with the key `EXTENSION_TYPE_NAME_KEY`. - An extension type registry is required for making use of this information. - Currently, this works only for direct access to columns. See #14247 for some discussions on this issue. - In the future, maybe we can move `DFSchema` towards including a `LogicalType` and a `DataType` (at least as a first step). The remaining approach still works with such an assumption. This is similar to "Option 1: User defined operators" from the discussion in #14247. Here are some thoughts on that: - From a logical standpoint, the expressions themselves do not change. It's still a `SortExpr` just on a special type. - Rewriting all `Expr` to UDFs via analyzers can be also tricky as we would then, for example, need to parameterize `SortExpr` with a custom ordering to allow such a rewriting. - This makes the physical plan creation more involved. - I think this is a great avenue to making it easy to work with custom types in DataFusion. Currently, AFAIK, if you want to do custom sorting or use `AggregateExec` (which assumes your data has a natural order) for your extension types you basically have to re implement a custom physical node or resort to work-arounds that cause problems (e.g., projecting out columns that are sortable). There are many changes in tests etc. so here is a list of "highlights" you should check out when taking a look at this PR. Highlights: - `datafusion/common/src/sort.rs`: new sort data structures and adapted procedures from arrow-rs. - `datafusion/common/src/types/logical.rs`: Extension to logical type to provide a `LogicalTypePlanningInformation`. - `datafusion/core/tests/dataframe/test_types.rs`: A logical type (`IntOrFloatType`) that implements a custom order. - `datafusion/core/tests/dataframe/mod.rs`: A test for sorting a on union type. - `datafusion/core/src/datasource/mod.rs`: Applying custom order during physical plan creation. - `datafusion/core/src/physical_planner.rs`: Applying custom order during physical plan creation. To get (a similar) approach upstream I'd suggest the following steps (splitting up the PR, adding tests etc.): 1. Add `SortOrdering` to `PhysicalSortExpr` and add support for user-defined sorting operations (`DynComparator` for now) 2. Add `ExtensionTypeRegistry` to support registering extension types. 3.1 Use said registry to look up whether we have planning information for `SortExpr` during planning and construct the correct `PhysicalSortExpr`. 3.X Adapt existing code that relies on the fact that types have a natural ordering (e.g., `AggregateExec`). This step will be done interatively. Note that this procedure leads to a state where some parts of DF already make use of user defined sorting, while other do not yet support it. However, I don't think this is a deal breaker. I'm eager to hear your thoughts! # Further Notes I think this would also require users to return a `Field` from a UDF instead of just a data type (#14247) to allow UDFs to return extension types. See TODO note in `datafusion/core/src/physical_planner.rs`. I think supported the row-based sorting from arrow-rs can also be supported in a similar fashion by extending `CustomOrdering`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org