findepi commented on issue #11513: URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2370610349
> ### Keeping track of the physical type > While logical types simplify the list of possible types that can be handled during logical planning, the relation to their underlying physical representation needs to be accounted for when transforming the logical plan into nested [ExecutionPlan](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html) and [PhysicalExpr](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.PhysicalExpr.html) which will dictate how will the query execute. Sorry if this is a dumb question. Do we actually need to track physical representation at planning time? Let look at an example. For example Utf8, LargeUtf8, Dictionary of Utf8, Utf8View, REE/RLE of Utf8 -- these are different physical representations that correspond to some string logical type (or varchar / varchar(n); doesn't matter). Knowing the data will be dictionary-encoded at (physical) planning time is useful (knowing more is always useful), but actually prevents run-time adaptiveness. One parquet file can have data flat and another dict-encoded, the Parquet writer can be adaptive, so at planning time there is no way to know physical representation "for sure". There is only way to "propose a physical representation", but then data needs to be made fit into that. At what cost and for what benefit? Cost is of two kinds - the system is more complex than necessary. it needs to deal with physical types in places where they do not matter. For example what does it mean to `arrow_cast( literal, dict_type )`? What does it mean to coerce a literal to REE/RLE? This is what brings requirement for `ScalarValue` to carry any physical type, even though for a constant scalar value this is not meaninful - the data may get unnecessarily converted to a different representation imposed by the plan Benefit? - the system could in theory be faster if we knew the physical representation and we wouldn't need to type check on it. But our functions do this anyway (cause they are polymorphic). Since function resolution is a logical type based, we won't get rid of runtime type checks, unless we let logical functions to resolve to physical functions (sounds _very_ complicated, let's not do this). Not sure if this benefit exists in practice. - others? Let me draw comparison, if I may... Trino internal design is IMO very well thought thru. Trino has same concepts of physical representations (flat, dict, rle), but they don't percolate into the type system. The type system is "logical types" (or just "types"), but more importantly the functions are authored in terms of same types. The function/expression implementor doesn't need to bother with dict or rle inputs, because they are taken care once and for all by the projection/filter operator. What do they miss by not dealing with "physical types" at planning time? Summing up, I propose that - we introduce the concept of "data fusion type". This is the "logical type" @notfilippo proposed. - we use this "data fusion type" for logical plans - we use this "data fusion type" for physical plans as well - this leaves existing "physical plans" to be a runtime concept - we use this "data fusion type" for function authoring, scalar/constant/literal values in the plan cc @notfilippo @alamb @comphead @andygrove @jayzhan211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org