Paul Rogers created DRILL-7455:
----------------------------------

             Summary: "Renaming" projection operator to avoid physical copies
                 Key: DRILL-7455
                 URL: https://issues.apache.org/jira/browse/DRILL-7455
             Project: Apache Drill
          Issue Type: Improvement
            Reporter: Paul Rogers


Drill/Calcite inserts project operators for three main reasons:

1. To compute a new column: {{SELECT a + b AS c ...}}

2. To rename columns: {{SELECT a AS x ...}}

3. To remove columns: {{SELECT a ...} but a data source provides columns {{a}}, 
and {{b}}.

Example of case 1:

{code:json}
    "pop" : "project",
    "@id" : 4,
    "exprs" : [ {
      "ref" : "`a0`",
      "expr" : "`a`"
    }, {
      "ref" : "`b0`",
      "expr" : "`b`"
    } ],
{code}

Of these, only case 2 requires row-by-row computation of new values. Case 1 
simply creates a new vector with only the name changed; but the same data. Case 
3 preserves some vectors, drops others.

In the cases 1 and 2, a simple data transfer from input to output would be 
adequate. Yet, if one steps through the code, and enables code generation, one 
will see that Drill steps through each record in all three cases, even calling 
an empty per-record compute block.

A better-performance solution is to separate out the renames/drops (cases 1 and 
3) from the column computations (case 2). This can be done either:

1. At plan time, identify that all columns are renames, and replace the 
row-by-row project with a column-level project.

2. At run time that identifies the column-level projections (cases 1 and 3) and 
handles those with transfer pairs, while doing row-by-row computes only if case 
2 exists.

Since row-by-row copies are among the most expensive operations in Drill, this 
optimization could improve performance by a decent amount.

Note that a further optimization is to remove "trivial" projects such as the 
following:

{code:json}
    "pop" : "project",
    "@id" : 2,
    "exprs" : [ {
      "ref" : "`a`",
      "expr" : "`a`"
    }, {
      "ref" : "`b`",
      "expr" : "`b`"
    }, {
      "ref" : "`b0`",
      "expr" : "`b0`"
    } ],
{code}

The only value of such a projection is to say, "remove all vectors except 
{{a}}, {{b}} and {{b0}}. In fact, the only time such a projection should be 
needed is:

1. On top of a data source that does not support projection push down.

2. When Calcite knows it wants to discard certain intermediate columns.

Otherwise, Calcite knows which columns emerge from operator x, and should not 
need to add a project to enforce that schema if it is already what the project 
will emit.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to