Thanks for the detailed write up. That all makes good sense to me. I am not
sure that I had a good reason for having physical expressions determine
their names.

On Fri, Jul 17, 2020, 12:50 PM Jorge (Jira) <j...@apache.org> wrote:

> Jorge created ARROW-9516:
> ----------------------------
>
>              Summary: [Rust][DataFusion] Refactor physical expressions to
> not care about their names nor indexes
>                  Key: ARROW-9516
>                  URL: https://issues.apache.org/jira/browse/ARROW-9516
>              Project: Apache Arrow
>           Issue Type: Improvement
>           Components: Rust - DataFusion
>             Reporter: Jorge
>
>
> This issue covers three main topics that IMO are addressed as a whole in a
> refactor of the physical plans and expressions in data fusion. The
> underlying issues that justify this particular ticket:
> h3. We currently assign poor names to the output schema.
>
> Specifically, most names are given based on the last expression's name.
> Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the
> fields names "c, SUM, SUM".
> h3. We currently derive the column names from physical expressions, not
> logical expressions
>
> This implies that logical expressions that perform multiple operations
> (e.g. an grouped aggregation that performs partitioned aggregations + merge
> + final aggregation) have their name derived from their physical
> declaration, not logical. IMO a physical plan is an execution plan and is
> thus not concerned with naming. It is the logical plan that should be
> concerned with naming. Conceptually, a given logical plan can have more
> than one physical plan, e.g. depending on the execution environment (e.g.
> locally vs distributed).
> h3. We currently carry the index of a column read throughout the plans,
> making it cumbersome to write optimizers.
>
> More details here. In summary, it is possible to remove one of the
> optimizers and significantly simplify the other if columns do not carry
> indexing information.
> h2. Proposal
>
> I propose that we:
> h3. drop {{physical_plan::expressions::Column::index}}
>
> This is a major simplification of the code, and allow us to just ignore
> the position of the statement on the schema, and instead focus on its name.
> This is overall a simplification because it allow us to treat columns based
> solely on their names, and not on their position in the schema. Since SQL
> does not care about the position of the column on the table anyway (we
> currently already take the first column with that name), this seems natural.
>
> I already prototyped this [here|
> https://github.com/jorgecarleitao/arrow/tree/column_names].
>
> The main conclusion of this prototype is that this feasible as long as all
> our expressions get assigned a unique name, which is against what we
> currently offer (see example above). This leads me to:
> h3. drop {{physical_plan::PhysicalExpr::name()}}
>
> Currently, the name of an expression is derived from its physical plan.
> However, some operations' names are required to be known before its
> physical representation. The example I found in our current code is the
> grouped aggregation described above. If we were to build the name of our
> aggregation based on its physical plan, the name of a "COUNT(a)" operation
> would be {{SUM(COUNT(a))}} because, in the physical plan we first count on
> each partition, then merge, and them sum the counts over all partitions.
>
> Fundamentally, IMO the issue here is that we are mixing responsibilities:
> the physical plan should not care about naming, because the physical plan
> corresponds to an execution plan, not a logical description of the column
> (its name). This leads me to:
> h3. add {{logicalplan::Expr::name()}}
>
> This will contain the name of this expression, that will naturally depend
> on the variation. Its implementation will be based on our current code for
> {{physical_plan::PhysicalExpr::name()}}.
>
> I can take this work, but before committing, would like to know your
> thoughts about this. My initial prototyping indicate that all of this is
> possible and greatly simplifies the code, but I may be missing a design
> aspect of this.
>
>
>
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)
>

Reply via email to