tobixdev opened a new pull request, #15106:
URL: https://github.com/apache/datafusion/pull/15106

   Related to: #14828, #12644, #14247
   
   This PR is a *rough* proposal for an implementation of user-defined sorting. 
The main goal is to get a discussion starting whether this is a direction we 
want to go.
   
   Some design considerations that help understand the PR:
   - `PhysicalSortExpr` allows defining a custom sort order, but does not know 
anything about extension types.
   - "Resolving" logical types happens during the creation of the initial 
physical plan.
   - `LogicalTypePlanningInformation` allows parameterizing this creation 
process
   - Obtaining the logical type is currently done by inspecting the `Field` 
whether there is an entry in the metadata with the key 
`EXTENSION_TYPE_NAME_KEY`.
        - An extension type registry is required for making use of this 
information.
        - Currently, this works only for direct access to columns. See #14247 
for some discussions on this issue.
        - In the future, maybe we can move `DFSchema` towards including a 
`LogicalType` and a `DataType` (at least as a first step). The remaining 
approach still works with such an assumption.
   
   This is similar to "Option 1: User defined operators" from the discussion in 
#14247. Here are some thoughts on that:
   - From a logical standpoint, the expressions themselves do not change. It's 
still a `SortExpr` just on a special type.
   - Rewriting all `Expr` to UDFs via analyzers can be also tricky as we would 
then, for example, need to parameterize `SortExpr` with a custom ordering to 
allow such a rewriting.
   - This makes the physical plan creation more involved.
   - I think this is a great avenue to making it easy to work with custom types 
in DataFusion. Currently, AFAIK, if you want to do custom sorting or use 
`AggregateExec` (which assumes your data has a natural order) for your 
extension types you basically have to re implement a custom physical node or 
resort to work-arounds that cause problems (e.g., projecting out columns that 
are sortable).
   
   There are many changes in tests etc. so here is a list of "highlights" you 
should check out when taking a look at this PR.
   Highlights:
   - `datafusion/common/src/sort.rs`: new sort data structures and adapted 
procedures from arrow-rs.
   - `datafusion/common/src/types/logical.rs`: Extension to logical type to 
provide a `LogicalTypePlanningInformation`.
   - `datafusion/core/tests/dataframe/test_types.rs`: A logical type 
(`IntOrFloatType`) that implements a custom order.
   - `datafusion/core/tests/dataframe/mod.rs`: A test for sorting a on union 
type.
   - `datafusion/core/src/datasource/mod.rs`: Applying custom order during 
physical plan creation.
   - `datafusion/core/src/physical_planner.rs`: Applying custom order during 
physical plan creation.
   
   To get (a similar) approach upstream I'd suggest the following steps 
(splitting up the PR, adding tests etc.):
   1. Add `SortOrdering` to `PhysicalSortExpr` and add support for user-defined 
sorting operations (`DynComparator` for now)
   2. Add `ExtensionTypeRegistry` to support registering extension types.
   3.1 Use said registry to look up whether we have planning information for 
`SortExpr` during planning and construct the correct `PhysicalSortExpr`.
   3.X Adapt existing code that relies on the fact that types have a natural 
ordering (e.g., `AggregateExec`). This step will be done interatively.
   
   Note that this procedure leads to a state where some parts of DF already 
make use of user defined sorting, while other do not yet support it. However, I 
don't think this is a deal breaker.
   
   I'm eager to hear your thoughts!
   
   # Further Notes
   
   I think this would also require users to return a `Field` from a UDF instead 
of just a data type (#14247) to allow UDFs to return extension types. See TODO 
note in `datafusion/core/src/physical_planner.rs`.
   
   I think supported the row-based sorting from arrow-rs can also be supported 
in a similar fashion by extending `CustomOrdering`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to