Thanks for the detailed write up. That all makes good sense to me. I am not sure that I had a good reason for having physical expressions determine their names.
On Fri, Jul 17, 2020, 12:50 PM Jorge (Jira) <j...@apache.org> wrote: > Jorge created ARROW-9516: > ---------------------------- > > Summary: [Rust][DataFusion] Refactor physical expressions to > not care about their names nor indexes > Key: ARROW-9516 > URL: https://issues.apache.org/jira/browse/ARROW-9516 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion > Reporter: Jorge > > > This issue covers three main topics that IMO are addressed as a whole in a > refactor of the physical plans and expressions in data fusion. The > underlying issues that justify this particular ticket: > h3. We currently assign poor names to the output schema. > > Specifically, most names are given based on the last expression's name. > Example: {{SELECT c, SUM(a > 2), SUM(b) FROM t GROUP BY c}} yields the > fields names "c, SUM, SUM". > h3. We currently derive the column names from physical expressions, not > logical expressions > > This implies that logical expressions that perform multiple operations > (e.g. an grouped aggregation that performs partitioned aggregations + merge > + final aggregation) have their name derived from their physical > declaration, not logical. IMO a physical plan is an execution plan and is > thus not concerned with naming. It is the logical plan that should be > concerned with naming. Conceptually, a given logical plan can have more > than one physical plan, e.g. depending on the execution environment (e.g. > locally vs distributed). > h3. We currently carry the index of a column read throughout the plans, > making it cumbersome to write optimizers. > > More details here. In summary, it is possible to remove one of the > optimizers and significantly simplify the other if columns do not carry > indexing information. > h2. Proposal > > I propose that we: > h3. drop {{physical_plan::expressions::Column::index}} > > This is a major simplification of the code, and allow us to just ignore > the position of the statement on the schema, and instead focus on its name. > This is overall a simplification because it allow us to treat columns based > solely on their names, and not on their position in the schema. Since SQL > does not care about the position of the column on the table anyway (we > currently already take the first column with that name), this seems natural. > > I already prototyped this [here| > https://github.com/jorgecarleitao/arrow/tree/column_names]. > > The main conclusion of this prototype is that this feasible as long as all > our expressions get assigned a unique name, which is against what we > currently offer (see example above). This leads me to: > h3. drop {{physical_plan::PhysicalExpr::name()}} > > Currently, the name of an expression is derived from its physical plan. > However, some operations' names are required to be known before its > physical representation. The example I found in our current code is the > grouped aggregation described above. If we were to build the name of our > aggregation based on its physical plan, the name of a "COUNT(a)" operation > would be {{SUM(COUNT(a))}} because, in the physical plan we first count on > each partition, then merge, and them sum the counts over all partitions. > > Fundamentally, IMO the issue here is that we are mixing responsibilities: > the physical plan should not care about naming, because the physical plan > corresponds to an execution plan, not a logical description of the column > (its name). This leads me to: > h3. add {{logicalplan::Expr::name()}} > > This will contain the name of this expression, that will naturally depend > on the variation. Its implementation will be based on our current code for > {{physical_plan::PhysicalExpr::name()}}. > > I can take this work, but before committing, would like to know your > thoughts about this. My initial prototyping indicate that all of this is > possible and greatly simplifies the code, but I may be missing a design > aspect of this. > > > > -- > This message was sent by Atlassian Jira > (v8.3.4#803005) >