[Impala-ASF-CR] IMPALA-2521: Add clustered hint to insert statements

Alex Behm (Code Review) Tue, 18 Oct 2016 16:01:43 -0700

Alex Behm has posted comments on this change.

Change subject: IMPALA-2521: Add clustered hint to insert statements
......................................................................

Patch Set 2:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/4745/2//COMMIT_MSG
Commit Message:

Line 11: distributed plan, just before the table sink. This has the effect that
-distributed (also adds to the single-node plan)

http://gerrit.cloudera.org:8080/#/c/4745/2/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java
File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java:

Line 238: * TODO: We could do something more sophisticated than simply
copying input
move this TODO into SortInfo.createSortTupleInfo()

Line 253: Set<SlotRef> sourceSlots = Sets.newHashSet();
these few lines can also be moved into createSortTupleInfo() right?

http://gerrit.cloudera.org:8080/#/c/4745/2/fe/src/main/java/org/apache/impala/analysis/SortInfo.java
File fe/src/main/java/org/apache/impala/analysis/SortInfo.java:

Line 125: * the ordering exprs refer to the materialized tuples instead of
the original input.
refer to the new tuple materialized by this sort instead of the original input.

Line 145: * output by the exec node implementing the sort. Done by
materializing slot refs in the
shorter: by the sort node

Line 146: * order-by and result expressions. Those slot refs in the ordering
and result exprs are
comment about result expressions only makes sense if we pass in the result
expressions to this function, otherwise it's not really clear where those come
from

Line 147: * substituted with slot refs into the new tuple. This simplifies
sorting logic for
we also do this for top-n

Line 152: public void createSortTupleInfo(Set<SlotRef> sourceSlots, boolean
materializeSlots, StatementBase stmt, Analyzer analyzer) {
I'd suggest removing the 'stmt' parameter and instead returning the
ExprEubstitutionMap. Then the stmt callers can apply that smap to whatever
exprs they need to change.

Line 153: // The tuple descriptor for the sort input. It will contain the
materialized tuples,
Isn't this the tuple descriptor for the sort output? The secnd sentence is a
little imprecise because ("it contains") because a tuple descriptor does not
contain materialized tuples.

Line 163: // the internal rows of the sort node.
builds up the tuple operated on and returned by sort node

Line 171: // TODO: I copied this from QueryStmt.java and would like to
add an explanatory
I think the else case may not be needed/possible, but the explanation is that
it's ok to move predicates outside of this query block, but it's not ok to move
predicates in if there is a limit. Consider this query:

select count(*) from t1 inner join
(select id, name from t2
where id < 10
order by limit 10) v
on (t1.id = v.id and t1.name = v.name)
where t1.name = 'test'

In the example above it's ok to generate 't1.id < 10' and put it into the scan
of t1, but it's not ok to generate 't2.name = 'test'' and move it into the scan
of t2.

Hope this makes sense.

Line 173: if (stmt.hasLimit()) {
Can we also move this to the caller? The returned smap can be used to create
this. The reason is that it doesn't really make sense to add these value
transfers during plan generation (for the clustered hint use case).

Line 181: if (materializeSlots) {
let's let the caller make this decision without passing a flag

http://gerrit.cloudera.org:8080/#/c/4745/2/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 149: // Add optional clustering node to the plan, based on
clustered/noclustered plan
Let's keep it concrete: we're adding a sort node

Line 497: * Insert a sort node into the plan, depending on the
clustered/noclustered plan hint.
Mention the ordering of the sort

Line 498: * This will re-order the data produced by 'inputFragment' in a way
that partitions can
Second sentence is kind of vague. I understand that sorting is just an instance
of a more general clustering concept, but let's stick to what we are actually
doing here :)

Line 503: if (!insertStmt.hasClusteredHint()) {
single line (and elsewhere)

Line 508: // Ignore constants for the sake of partitioning.
partitioning?

http://gerrit.cloudera.org:8080/#/c/4745/2/testdata/workloads/functional-planner/queries/PlannerTest/insert.test
File testdata/workloads/functional-planner/queries/PlannerTest/insert.test:

Line 573: # IMPALA-2521: Test to make sure that clustering adds a sort node
into the plan.
Let's test these combinations:
1. Partitioned tables
1.1. With clustered hint
1.2. With clustered and noshuffle hint
2. Unpartitioned tables
2.1. With clustered hint
2.2. With clustered and shuffle hint

Line 575: select * from functional.alltypes
Let's also test Kudu tables because there we want to sort on the PKs. Let's add
those tests to a Kudu-specific .test file

Line 576: ---- DISTRIBUTEDPLAN
also add ---- PLAN section

--
To view, visit http://gerrit.cloudera.org:8080/4745
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2521: Add clustered hint to insert statements

Reply via email to