[
https://issues.apache.org/jira/browse/DRILL-3876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944191#comment-14944191
]
Jason Altekruse commented on DRILL-3876:
----------------------------------------
While there might be an argument to include projection in other operators, this
isn't actually the issue with flatten. When flatten was implemented, we created
several planning rules to work around overall limitations in project and the
expression materialization system, as well as enable nested flattens without
complicating the operator.
The code that needs to be fixed to remove the extra projection column is in
SplitUpComplexExpressions. This new rule was necessary to allow for a function
to pass a complex output into another function.
We don't have many functions that return complex outputs, but one possible use
of the function could be something like this:
select convert_to( kvgen( a_map), 'JSON') from dfs.`/table.json`
The output of kvgen is a repeated map, and convert_to(field, 'JSON') expects a
complex object as input. Drill does not currently support passing the complex
output from one expression (in the form of a FieldWriter) as a direct input to
another expression, at least not within a single project. The data must be
serialized to a vector and then a FieldReader must be created to feed data into
another function expecting the complex input.
To enable this functionality, without enhancing project, the
SplitUpComplexExpressions rule was added to break up each complex expression
into its own project. This is currently acting inefficiently and assuming that
an extra copy of the incoming data may need to be kept around for input into a
different expression. For most of planning, flatten is treated like a complex
expression in a project. Right after this SplitUpComplexExpressions is run,
there is a separate rule that turns a project with a single flatten in it into
a combination of a project and a flatten operation.
Example:
select flatten(a_list), a_list[0] from table;
Here the best thing to do would be to evaluate the indexing into the list
before flatten. Right now the rule is just making an extra copy of the original
list, assuming an evaluation like this might need to happen later. This is even
happening where there are no other expressions in the project, which is just a
complete waste. There are a couple of simple fixes, the rule could do nothing
in the case where only a single expression is present. The right thing to do is
to enhance the rule to look for other usages of the input to a complex
expression amongst other expressions the the project, if none are found there
is no need for the extra copy of the data.
You can see the desired "correct" behavior on a a simple flatten by actually
just commenting out the rule in DefaultSqlHandler line 342. This makes a few of
the tests fail that are relying on the rule, but it does make the basic case
work.
> flatten() should not require a subsequent project to strip columns that
> aren't required
> ---------------------------------------------------------------------------------------
>
> Key: DRILL-3876
> URL: https://issues.apache.org/jira/browse/DRILL-3876
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Flow
> Affects Versions: 1.2.0
> Reporter: Chris Westin
> Assignee: Chris Westin
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)