tobixdev commented on PR #18552:
URL: https://github.com/apache/datafusion/pull/18552#issuecomment-3753478776

   Thank you for finding the time to take a look at it! I appreciate it.
    
   > What is your vision about final end state? WOuld DataFusion would be 
passing around a LogicalType (rather than a `Field` as it does now)? We haven't 
even completed the transition to Field (from DataType) so that might be 
challenging to pull off.
   > [...]
   > Maybe we could add the logical type information into DFField and DFSchema 
somehow? For example, could we add the resolved LogicalType into DFField (and 
use DFField more in the codebase)?
   
   That is a great question. The approach in this PR is simply the one with the 
least amount of plumbing necessary. I have three options in my head (maybe 
there are more):
   - Directly use Field with [extension 
capabilities](https://github.com/apache/arrow-rs/issues/8730). This idea is 
more born out of the desire to not have another transition phase. The critique 
of @paleolimbot in 
https://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880 is very 
warranted and I also have my doubts whether this is a good idea.
   - Use DFField as you've mentioned. This was actually the first path I tried 
but more plumbing was necessary and for this PoC I used `FieldMetadata` 
instead. What I had in mind is that when creating the execution plan, all 
"LogicalTypes" "disapper" and the information necessary should be extracted by 
the planner. Long long time ago I made a [draft PR that implemented custom 
sorting](https://github.com/apache/datafusion/pull/15106) with a similar 
approach based on a type registry. There the `LogicalType` had a method that 
contained information on the default sort order of that type and allowed to 
rewrite the default. Based on this information, the `DefaultPhysicalPlanner` I 
then created sort exrpressions that contained "all knowledge about the custom 
order". Then it's ok for the execution plan to use `Field` instead of 
`DFField`. It's not ideal as some execution plans / optimization may reuqire 
access to logical type (I guess ?) and I would also like refer to a quote from 
@paleolimb
 ot below that provides another perspective.
   - Use a new type that can be easily obtained from a `Field` that only 
contains the "Type-related" information to avoid the issues from the quote 
below.
   
   To be honest, I don't know what's the best way forward. Maybe it's a hybrid 
between option 2) and 3) where we can easily extract the "Type-realted" 
information from the `DFField` instance.
   
   > It seems to me that to really avoid having to pass around the registry, we 
would have to have a step that "resolved types" up front (somewhat different 
than the Lazy Type Resolving principle above)
   
   Yes. I think the goto way will be having `.logical_type()?` calls that error 
when the type is not resolved. There are two ways I could think of:
   - Resolve the types when an expression is passed into a builder. What to do 
with custom-made plans?
   - Have a type resolver pass or something similar in the logical planner that 
resolves all the types at the beginning (maybe also for optimizations that 
opt-in?). The problem with this is i) effect on performance and ii) encoding 
the invariant that all types have been resolved in the type system would be 
tricky (we would need an `ResolvedExpr` or similar). The alternative is 
throwing an error at runtime as mentioned above.
   
   I have to admit though that there are some open points where I do not have a 
good answer for. I think requiring a registry everywhere, where we are handling 
`Expr` or `LogicalPlan` will be a nightmare to work with but I don't know if 
the approach above is the best one to avoid that.
   
   ### Quote Mentioned Above
   
   > The recent replacement of FieldRef for DataType is problematic in the 
other direction...rather than store too little information, it stores too much. 
The name, nullability, and non-extension metadata sometimes matter and 
sometimes don't. This usually shows up as schema mismatch errors and/or 
confusion over how exactly how to propagate the information (e.g., there are 
two PRs implementing a logical Cast in the DataFusion logical plan that do very 
different things with all three of those!). These end up affecting everybody, 
not just users of extension types.
   @paleolimbot in 
https://github.com/apache/arrow-rs/issues/8730#issuecomment-3463793880


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to