findepi commented on issue #11513:
URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2370610349

   > ### Keeping track of the physical type
   > While logical types simplify the list of possible types that can be 
handled during logical planning, the relation to their underlying physical 
representation needs to be accounted for when transforming the logical plan 
into nested 
[ExecutionPlan](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html)
 and 
[PhysicalExpr](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.PhysicalExpr.html)
 which will dictate how will the query execute.
   
   Sorry if this is a dumb question.
   Do we actually need to track physical representation at planning time?
   
   Let look at an example.
   For example Utf8, LargeUtf8, Dictionary of Utf8, Utf8View, REE/RLE of Utf8 
-- these are different physical representations that correspond to some string 
logical type (or varchar / varchar(n); doesn't matter). 
   Knowing the data will be dictionary-encoded at (physical) planning time is 
useful (knowing more is always useful), but actually prevents run-time 
adaptiveness. One parquet file can have data flat and another dict-encoded, the 
Parquet writer can be adaptive, so at planning time there is no way to know 
physical representation "for sure". There is only way to "propose a physical 
representation", but then data needs to be made fit into that. 
   
   At what cost and for what benefit? Cost is of two kinds
   - the system is more complex than necessary. it needs to deal with physical 
types in places where they do not matter. For example what does it mean to 
`arrow_cast( literal, dict_type )`? What does it mean to coerce a literal to 
REE/RLE? This is what brings requirement for `ScalarValue` to carry any 
physical type, even though for a constant scalar value this is not meaninful
   - the data may get unnecessarily converted to a different representation 
imposed by the plan
   
   Benefit?
   - the system could in theory be faster if we knew the physical 
representation and we wouldn't need to type check on it. But our functions do 
this anyway (cause they are polymorphic). Since function resolution is a 
logical type based, we won't get rid of runtime type checks, unless we let 
logical functions to resolve to physical functions (sounds _very_ complicated, 
let's not do this).
     Not sure if this benefit exists in practice.
   - others?
   
   Let me draw comparison, if I may...
   Trino internal design is IMO very well thought thru. Trino has same concepts 
of physical representations (flat, dict, rle), but they don't percolate into 
the type system. The type system is "logical types" (or just "types"), but more 
importantly the functions are authored in terms of same types. The 
function/expression implementor doesn't need to bother with dict or rle inputs, 
because they are taken care once and for all by the projection/filter operator. 
What do they miss by not dealing with "physical types" at planning time?
   
   Summing up, I propose that 
   
   - we introduce the concept of "data fusion type". This is the "logical type" 
@notfilippo proposed.
   - we use this "data fusion type" for logical plans
   - we use this "data fusion type" for physical plans as well
     - this leaves existing "physical plans" to be a runtime concept
   - we use this "data fusion type" for function authoring, 
scalar/constant/literal values in the plan
   
   
   cc @notfilippo @alamb @comphead @andygrove @jayzhan211 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to