Alex Behm has posted comments on this change. Change subject: IMPALA-2521: Add clustered hint to insert statements ......................................................................
Patch Set 2: (21 comments) http://gerrit.cloudera.org:8080/#/c/4745/2//COMMIT_MSG Commit Message: Line 11: distributed plan, just before the table sink. This has the effect that -distributed (also adds to the single-node plan) http://gerrit.cloudera.org:8080/#/c/4745/2/fe/src/main/java/org/apache/impala/analysis/QueryStmt.java File fe/src/main/java/org/apache/impala/analysis/QueryStmt.java: Line 238: * TODO: We could do something more sophisticated than simply copying input move this TODO into SortInfo.createSortTupleInfo() Line 253: Set<SlotRef> sourceSlots = Sets.newHashSet(); these few lines can also be moved into createSortTupleInfo() right? http://gerrit.cloudera.org:8080/#/c/4745/2/fe/src/main/java/org/apache/impala/analysis/SortInfo.java File fe/src/main/java/org/apache/impala/analysis/SortInfo.java: Line 125: * the ordering exprs refer to the materialized tuples instead of the original input. refer to the new tuple materialized by this sort instead of the original input. Line 145: * output by the exec node implementing the sort. Done by materializing slot refs in the shorter: by the sort node Line 146: * order-by and result expressions. Those slot refs in the ordering and result exprs are comment about result expressions only makes sense if we pass in the result expressions to this function, otherwise it's not really clear where those come from Line 147: * substituted with slot refs into the new tuple. This simplifies sorting logic for we also do this for top-n Line 152: public void createSortTupleInfo(Set<SlotRef> sourceSlots, boolean materializeSlots, StatementBase stmt, Analyzer analyzer) { I'd suggest removing the 'stmt' parameter and instead returning the ExprEubstitutionMap. Then the stmt callers can apply that smap to whatever exprs they need to change. Line 153: // The tuple descriptor for the sort input. It will contain the materialized tuples, Isn't this the tuple descriptor for the sort output? The secnd sentence is a little imprecise because ("it contains") because a tuple descriptor does not contain materialized tuples. Line 163: // the internal rows of the sort node. builds up the tuple operated on and returned by sort node Line 171: // TODO: I copied this from QueryStmt.java and would like to add an explanatory I think the else case may not be needed/possible, but the explanation is that it's ok to move predicates outside of this query block, but it's not ok to move predicates in if there is a limit. Consider this query: select count(*) from t1 inner join (select id, name from t2 where id < 10 order by limit 10) v on (t1.id = v.id and t1.name = v.name) where t1.name = 'test' In the example above it's ok to generate 't1.id < 10' and put it into the scan of t1, but it's not ok to generate 't2.name = 'test'' and move it into the scan of t2. Hope this makes sense. Line 173: if (stmt.hasLimit()) { Can we also move this to the caller? The returned smap can be used to create this. The reason is that it doesn't really make sense to add these value transfers during plan generation (for the clustered hint use case). Line 181: if (materializeSlots) { let's let the caller make this decision without passing a flag http://gerrit.cloudera.org:8080/#/c/4745/2/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: Line 149: // Add optional clustering node to the plan, based on clustered/noclustered plan Let's keep it concrete: we're adding a sort node Line 497: * Insert a sort node into the plan, depending on the clustered/noclustered plan hint. Mention the ordering of the sort Line 498: * This will re-order the data produced by 'inputFragment' in a way that partitions can Second sentence is kind of vague. I understand that sorting is just an instance of a more general clustering concept, but let's stick to what we are actually doing here :) Line 503: if (!insertStmt.hasClusteredHint()) { single line (and elsewhere) Line 508: // Ignore constants for the sake of partitioning. partitioning? http://gerrit.cloudera.org:8080/#/c/4745/2/testdata/workloads/functional-planner/queries/PlannerTest/insert.test File testdata/workloads/functional-planner/queries/PlannerTest/insert.test: Line 573: # IMPALA-2521: Test to make sure that clustering adds a sort node into the plan. Let's test these combinations: 1. Partitioned tables 1.1. With clustered hint 1.2. With clustered and noshuffle hint 2. Unpartitioned tables 2.1. With clustered hint 2.2. With clustered and shuffle hint Line 575: select * from functional.alltypes Let's also test Kudu tables because there we want to sort on the PKs. Let's add those tests to a Kudu-specific .test file Line 576: ---- DISTRIBUTEDPLAN also add ---- PLAN section -- To view, visit http://gerrit.cloudera.org:8080/4745 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I412153bd8435d792bd61dea268d7a3b884048f14 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-HasComments: Yes
