tobixdev commented on PR #18552:
URL: https://github.com/apache/datafusion/pull/18552#issuecomment-3753478776
Thank you for finding the time to take a look at it! I appreciate it.
> What is your vision about final end state? WOuld DataFusion would be
passing around a LogicalType (rather than a `Field` as it does now)? We haven't
even completed the transition to Field (from DataType) so that might be
challenging to pull off.
> [...]
> Maybe we could add the logical type information into DFField and DFSchema
somehow? For example, could we add the resolved LogicalType into DFField (and
use DFField more in the codebase)?
That is a great question. The approach in this PR is simply the one with the
least amount of plumbing necessary. I have three options in my head (maybe
there are more):
- Directly use Field with [extension
capabilities](https://github.com/apache/arrow-rs/issues/8730). This idea is
more born out of the desire to not have another transition phase. The critique
of @paleolimbot in
https://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880 is very
warranted and I also have my doubts whether this is a good idea.
- Use DFField as you've mentioned. This was actually the first path I tried
but more plumbing was necessary and for this PoC I used `FieldMetadata`
instead. What I had in mind is that when creating the execution plan, all
"LogicalTypes" "disapper" and the information necessary should be extracted by
the planner. Long long time ago I made a [draft PR that implemented custom
sorting](https://github.com/apache/datafusion/pull/15106) with a similar
approach based on a type registry. There the `LogicalType` had a method that
contained information on the default sort order of that type and allowed to
rewrite the default. Based on this information, the `DefaultPhysicalPlanner` I
then created sort exrpressions that contained "all knowledge about the custom
order". Then it's ok for the execution plan to use `Field` instead of
`DFField`. It's not ideal as some execution plans / optimization may reuqire
access to logical type (I guess ?) and I would also like refer to a quote from
@paleolimb
ot below that provides another perspective.
- Use a new type that can be easily obtained from a `Field` that only
contains the "Type-related" information to avoid the issues from the quote
below.
To be honest, I don't know what's the best way forward. Maybe it's a hybrid
between option 2) and 3) where we can easily extract the "Type-realted"
information from the `DFField` instance.
> It seems to me that to really avoid having to pass around the registry, we
would have to have a step that "resolved types" up front (somewhat different
than the Lazy Type Resolving principle above)
Yes. I think the goto way will be having `.logical_type()?` calls that error
when the type is not resolved. There are two ways I could think of:
- Resolve the types when an expression is passed into a builder. What to do
with custom-made plans?
- Have a type resolver pass or something similar in the logical planner that
resolves all the types at the beginning (maybe also for optimizations that
opt-in?). The problem with this is i) effect on performance and ii) encoding
the invariant that all types have been resolved in the type system would be
tricky (we would need an `ResolvedExpr` or similar). The alternative is
throwing an error at runtime as mentioned above.
I have to admit though that there are some open points where I do not have a
good answer for. I think requiring a registry everywhere, where we are handling
`Expr` or `LogicalPlan` will be a nightmare to work with but I don't know if
the approach above is the best one to avoid that.
### Quote Mentioned Above
> The recent replacement of FieldRef for DataType is problematic in the
other direction...rather than store too little information, it stores too much.
The name, nullability, and non-extension metadata sometimes matter and
sometimes don't. This usually shows up as schema mismatch errors and/or
confusion over how exactly how to propagate the information (e.g., there are
two PRs implementing a logical Cast in the DataFusion logical plan that do very
different things with all three of those!). These end up affecting everybody,
not just users of extension types.
@paleolimbot in
https://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]