Alex Behm has posted comments on this change. Change subject: IMPALA-3586: Implement union passthrough ......................................................................
Patch Set 7: (40 comments) http://gerrit.cloudera.org:8080/#/c/5816/7/be/src/exec/analytic-eval-node.cc File be/src/exec/analytic-eval-node.cc: Line 145: DCHECK(child(0)->row_desc().IsPrefixOfEquivalent(row_desc())); This should be IsPrefixOf() because we sanity checking the row descriptors of exec nodes (and not row batches). http://gerrit.cloudera.org:8080/#/c/5816/7/be/src/exec/union-node.cc File be/src/exec/union-node.cc: Line 115: // passthrough case, the child can be closed right away because the row batch received the child can be closed right away -> the child was already closed? Line 116: // from the child is copied (more details below). accuracy: the row batch wasn't copied Line 121: if (child_idx_ < children_.size() && isChildPassThrough(child_idx_)) { Suggest comment: // Handle passthrough children. We pass 'row_batch' directly into the GetNext() call on the child. Line 122: // If the rows from the current child can be passed through, the physical row layout This comment doesn't seem to add anything, I suggest removing it. Line 131: // It may be possible that the row batch is not empty, so we save the previous value. More details would be helpful. If the batch has rows at this point, I'd imagine it can cause all sorts of other problems. How can the batch already have rows? Line 148: // 'needs_deep_copy' lets us safely close the child in the next GetNext() call. style: 'needs_deep_copy' is not a visible variable here, suggest just saying "Marking the batch as needing a deep copy let's us ... Line 154: DCHECK that child_idx_ is not passthrough here http://gerrit.cloudera.org:8080/#/c/5816/7/be/src/exec/union-node.h File be/src/exec/union-node.h: Line 67: // Which children can be passed through, without being materialized. ... without evaluating and materializing their exprs. http://gerrit.cloudera.org:8080/#/c/5816/7/be/src/runtime/descriptors.h File be/src/runtime/descriptors.h: Line 412: /// Return true if the physical layout of this descriptor matches the physical layout matches that of other_desc Line 413: /// of other_desc, but not necessarily ids. bot not necessarily the id. Line 565: /// of other_desc, but not necessarily ids. but not necessarily the ids http://gerrit.cloudera.org:8080/#/c/5816/7/be/src/service/query-options.h File be/src/service/query-options.h: Line 38: TImpalaQueryOptions::DISABLE_UNION_PASSTHROUGH + 1);\ I tend to prefer ENABLE_UNION_PASSTHROUGH. To me positive phrasing is a little easier to understand. What do you think? http://gerrit.cloudera.org:8080/#/c/5816/7/fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java File fe/src/main/java/org/apache/impala/analysis/SlotDescriptor.java: Line 227: public boolean isEquivalent(SlotDescriptor other) { Unfortunately, the term 'equivalent' already has a different meaning in the FE for slots, so it would be good to the existing term fro this new one. Maybe isLayoutEquivalent()? http://gerrit.cloudera.org:8080/#/c/5816/7/fe/src/main/java/org/apache/impala/analysis/UnionStmt.java File fe/src/main/java/org/apache/impala/analysis/UnionStmt.java: Line 616: public List<Expr> getUnionNodeResultExprs() { getUnionResultExprs() http://gerrit.cloudera.org:8080/#/c/5816/7/fe/src/main/java/org/apache/impala/planner/UnionNode.java File fe/src/main/java/org/apache/impala/planner/UnionNode.java: Line 47: * a child has an identical tuple layout as the output of the union node, the ... as the output of the union node, and the child only has naked SlotRefs as result exprs, then the child is marked as 'passthrough'. The rows of passthrough children are directly returned by the union node, instead of materializing the child's result exprs into new tuples. Line 57: protected final List<Expr> resultExprs_; unionResultExprs_ to make distinguish it from the resultExprLists_ and such Line 73: // If false, no batches from child nodes would be passed through in the backend. Comment should describe what this flag is. Also you mean "true" right? Line 76: // Indicates for which child nodes batches can be passed through in the backend. remove "in the backend" (it's clear that execution happens there) Line 81: protected UnionNode(PlanNodeId id, TupleId tupleId) { Is this c'tor still needed? If not, remove. Line 89: List<Expr> resultExprs, boolean isInSubplan) { indentation, unionResultExprs Line 182: * Compute whether we can pass through rows without materializing for the given child. Can combine/simplify like this: Returns true if rows from the child with 'childTupleIds' and 'childResultExprs' can be returned directly by the union node (without materialization into a new tuple). Might be good to list the conditions for passthrough compatibility. Line 189: Analyzer analyzer, List<TupleId> childTupleIds, List<Expr> childExprList) { childExprList -> childResultExprs Line 190: if (analyzer.getQueryOptions().isDisable_union_passthrough()) return false; seems clearer to move this into init() and not execute any of the passthrough code Line 193: // TODO: Remove this as part of IMPALA-4179. Move TODO to the class comment Line 194: if (isInSubplan_) return false; same here, seems easier to move this check into init() Line 205: // Verify that the union node has one slot for every expression. union node -> union tuple descriptor Line 212: if (resultExprs_.size() != childTupleDescriptor.getSlots().size()) return false; I don't think this tricky check is correct because it won't allow passthrough for something like: select int_col, int_col, int_col from functional.alltypes union all select int_col, int_col, int_col from functional.alltypes Line 218: SlotRef unionSlot = resultExprs_.get(i).unwrapSlotRef(false); unionSlotRef, childSlotRef Line 220: if (!unionTupleDescriptor.getSlots().get(i).isMaterialized()) continue; move above the unwrapSlotRef() calls Line 221: Preconditions.checkState(unionSlot.getDesc().getParent().getId().equals(tupleId_)); Don't think we need this check, but something like this would be good: Preconditions.checkStateNotNull(unionSlotRef); Line 223: Preconditions.checkState(!(childSlot instanceof SlotRef)); No need for this check Line 262: // Compute which nodes can be passed through. which child nodes Line 266: // Check that if the child outputs a single tuple, then it's not nullable. Tuple move into computePassThrough http://gerrit.cloudera.org:8080/#/c/5816/7/testdata/workloads/functional-planner/queries/PlannerTest/kudu.test File testdata/workloads/functional-planner/queries/PlannerTest/kudu.test: Line 329: # IMPALA-3586: The operand with the Kudu scan cannot be passed through because it's not because id is not-nullable (primary key) Line 346: select id from functional_kudu.alltypes do select * http://gerrit.cloudera.org:8080/#/c/5816/7/testdata/workloads/functional-planner/queries/PlannerTest/union.test File testdata/workloads/functional-planner/queries/PlannerTest/union.test: Line 3104: # IMPALA-3678: Both union operands should produce rows with non-nullable slots which remove "should" Line 3124: # IMPALA-3678: The Union operands that contain a join should not be passed through, nice Line 3184: select COUNT(c.c_custkey), COUNT(v.tot_price) lowercase count for consistency http://gerrit.cloudera.org:8080/#/c/5816/7/testdata/workloads/functional-query/queries/QueryTest/union.test File testdata/workloads/functional-query/queries/QueryTest/union.test: Line 1126: select COUNT(c.c_custkey), COUNT(v.tot_price) lowercase count for consistency -- To view, visit http://gerrit.cloudera.org:8080/5816 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia8f6d5062724ba5b78174c3227a7a796d10d8416 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-HasComments: Yes