Re: [I] [Proposal] Decouple logical from physical types [datafusion]

via GitHub Tue, 24 Sep 2024 01:28:55 -0700


findepi commented on issue #11513:
URL: https://github.com/apache/datafusion/issues/11513#issuecomment-2370610349

> ### Keeping track of the physical type
> While logical types simplify the list of possible types that can be
handled during logical planning, the relation to their underlying physical
representation needs to be accounted for when transforming the logical plan
into nested
[ExecutionPlan](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html)
and
[PhysicalExpr](https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.PhysicalExpr.html)
which will dictate how will the query execute.

Sorry if this is a dumb question.
Do we actually need to track physical representation at planning time?

Let look at an example.
For example Utf8, LargeUtf8, Dictionary of Utf8, Utf8View, REE/RLE of Utf8
-- these are different physical representations that correspond to some string
logical type (or varchar / varchar(n); doesn't matter).
Knowing the data will be dictionary-encoded at (physical) planning time is
useful (knowing more is always useful), but actually prevents run-time
adaptiveness. One parquet file can have data flat and another dict-encoded, the
Parquet writer can be adaptive, so at planning time there is no way to know
physical representation "for sure". There is only way to "propose a physical
representation", but then data needs to be made fit into that.

At what cost and for what benefit? Cost is of two kinds
- the system is more complex than necessary. it needs to deal with physical
types in places where they do not matter. For example what does it mean to
`arrow_cast( literal, dict_type )`? What does it mean to coerce a literal to
REE/RLE? This is what brings requirement for `ScalarValue` to carry any
physical type, even though for a constant scalar value this is not meaninful
- the data may get unnecessarily converted to a different representation
imposed by the plan

Benefit?
- the system could in theory be faster if we knew the physical
representation and we wouldn't need to type check on it. But our functions do
this anyway (cause they are polymorphic). Since function resolution is a
logical type based, we won't get rid of runtime type checks, unless we let
logical functions to resolve to physical functions (sounds _very_ complicated,
let's not do this).
Not sure if this benefit exists in practice.
- others?

Let me draw comparison, if I may...
Trino internal design is IMO very well thought thru. Trino has same concepts
of physical representations (flat, dict, rle), but they don't percolate into
the type system. The type system is "logical types" (or just "types"), but more
importantly the functions are authored in terms of same types. The
function/expression implementor doesn't need to bother with dict or rle inputs,
because they are taken care once and for all by the projection/filter operator.
What do they miss by not dealing with "physical types" at planning time?

Summing up, I propose that

- we introduce the concept of "data fusion type". This is the "logical type"
@notfilippo proposed.
- we use this "data fusion type" for logical plans
- we use this "data fusion type" for physical plans as well
- this leaves existing "physical plans" to be a runtime concept
- we use this "data fusion type" for function authoring,
scalar/constant/literal values in the plan

cc @notfilippo @alamb @comphead @andygrove @jayzhan211

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] [Proposal] Decouple logical from physical types [datafusion]

Reply via email to