findepi commented on issue #11513:
URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2242825107

   I like the idea of separating logical types from arrow types, but it would 
be great to understand the exact consequences. 
   DataFusion is both a SQL execution engine (so it has a SQL frontend) and 
also a query execution library (so it has rich programmable API). 
   
   The SQL frontend should have "very logical types". For example, we don't 
need `Time32(unit)` and `Time64(unit)`. The SQL user isn't interested in how 
different precisions are implemented. Having `Time(unit)` should be enough (and 
different units could use different physical representations under the hood).
   Also, `Timestamp(unit, tz)` may want to be revisited from SQL perspective. 
SQL spec defines two separate types "timestamp" (without zone) and "timestamp 
with time zone" (point in time semantics + zone information), It might be 
possible (with some limitations) to have both use same physical representation 
(like arrow `Timestamp(unit, tz)`), but logically they want to be distinct. 
   
   Then, from DF-as-a-library perspective, physical representation becomes 
important.
   To drive the design for this part, we need to understand how DF-as-a-library 
is used. What are its necessary contractual obligation and what can be 
implementation detail. At this level we need to be even more extensible, since  
adding more types into the type system feels very natural for DF-as-a-library 
use-case. We also may need to be more physical, like perhaps discerning between 
time32 and time64.
   It would be great if, as proposed here, this layer still didn't have to deal 
with various equivalent ways of encoding semantically equivalent data. I.e. 
DF-as-a-library use-case still doesn't want to discern between `Utf8`, 
`Dictionary(Utf8)` `RunEndEncoded(Utf8)`. This shouldn't be in the type system 
at all. A function operating on string values should work with any valid 
representation of string values (either intrinsically, or with an adapter).
   
   > 
[ColumnarValue](https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.ColumnarValue.html#)
 enum could be extended so that functions could choose to provide their own 
optimised implementation for a subset of physical types and then fall back to a 
generic implementation that materialises the argument to known physical type. 
This would potentially allow native functions to support user defined physical 
types that map to known logical types.
   
   I am concerned about deriving support for a logical type based on the 
support for a physical type is actually slipper slope.
   
   Let's consider an example. Let's assume i have `my_duration({micros | 
nanos})` type with uses 64-bit integer physical type for representation.  
`my_duration` values are always stored with nano precision, and the unit 
defines how aligned they have to be. I.e. 1s is always stored as integer value 
1_000_000_000 for both precisions allowed. Every value of `my_duration(micros)` 
is _represented as_ i64 number divisible by 1000.
   
   Now, i have add_one() function that can take 64-bit integer values and add 
+1 to them. The +1 operation is perfectly valid operation for i64 -- it's valid 
for sql long/bigint type. It's valid also for my_duration(nanos), but it's not 
valid for my_duration(micros), since it produces unaligned value (not divisible 
by 1000).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to