Jorge created ARROW-9516:
----------------------------
Summary: [Rust][DataFusion] Refactor physical expressions to not
care about their names nor indexes
Key: ARROW-9516
URL: https://issues.apache.org/jira/browse/ARROW-9516
Project: Apache Arrow
Issue Type: Improvement
Components: Rust - DataFusion
Reporter: Jorge
This issue covers three main topics that IMO are addressed as a whole in a
refactor of the physical plans and expressions in data fusion. The underlying
issues that justify this particular ticket:
h3. We currently assign poor names to the output schema.
Specifically, most names are given based on the last expression's name.
Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the fields
names "c, SUM, SUM".
h3. We currently derive the column names from physical expressions, not logical
expressions
This implies that logical expressions that perform multiple operations (e.g. an
grouped aggregation that performs partitioned aggregations + merge + final
aggregation) have their name derived from their physical declaration, not
logical. IMO a physical plan is an execution plan and is thus not concerned
with naming. It is the logical plan that should be concerned with naming.
Conceptually, a given logical plan can have more than one physical plan, e.g.
depending on the execution environment (e.g. locally vs distributed).
h3. We currently carry the index of a column read throughout the plans, making
it cumbersome to write optimizers.
More details here. In summary, it is possible to remove one of the optimizers
and significantly simplify the other if columns do not carry indexing
information.
h2. Proposal
I propose that we:
h3. drop {{physical_plan::expressions::Column::index}}
This is a major simplification of the code, and allow us to just ignore the
position of the statement on the schema, and instead focus on its name. This is
overall a simplification because it allow us to treat columns based solely on
their names, and not on their position in the schema. Since SQL does not care
about the position of the column on the table anyway (we currently already take
the first column with that name), this seems natural.
I already prototyped this
[here|https://github.com/jorgecarleitao/arrow/tree/column_names].
The main conclusion of this prototype is that this feasible as long as all our
expressions get assigned a unique name, which is against what we currently
offer (see example above). This leads me to:
h3. drop {{physical_plan::PhysicalExpr::name()}}
Currently, the name of an expression is derived from its physical plan.
However, some operations' names are required to be known before its physical
representation. The example I found in our current code is the grouped
aggregation described above. If we were to build the name of our aggregation
based on its physical plan, the name of a "COUNT(a)" operation would be
{{SUM(COUNT(a))}} because, in the physical plan we first count on each
partition, then merge, and them sum the counts over all partitions.
Fundamentally, IMO the issue here is that we are mixing responsibilities: the
physical plan should not care about naming, because the physical plan
corresponds to an execution plan, not a logical description of the column (its
name). This leads me to:
h3. add {{logicalplan::Expr::name()}}
This will contain the name of this expression, that will naturally depend on
the variation. Its implementation will be based on our current code for
{{physical_plan::PhysicalExpr::name()}}.
I can take this work, but before committing, would like to know your thoughts
about this. My initial prototyping indicate that all of this is possible and
greatly simplifies the code, but I may be missing a design aspect of this.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)