Tianyi Wang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8317 )
Change subject: IMPALA-5976: Remove equivalence class computation in FE ...................................................................... Patch Set 2: (46 comments) http://gerrit.cloudera.org:8080/#/c/8317/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8317/1//COMMIT_MSG@7 PS1, Line 7: IMPALA-5976: Remove equivalence class computation in FE > Remove equivalence classes from FE Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java File fe/src/main/java/org/apache/impala/analysis/Analyzer.java: http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@a2890 PS1, Line 2890: > This function was nice for debugging. Does your patch add a similar functio Added a function Graph.print, called when graph validation fails in testing. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1253 PS1, Line 1253: e.getIds(tids, null); > This new formatting is extremely hard to read, please reformat for readabil Reverted to the original. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1496 PS1, Line 1496: if (allDestSids.isEmpty()) continue; > Unless we are absolutely convinced that the change is correct, I think we s It is changed to checking whether src has value transfer to an an outer joined tuple, as discussed. The changed tests are also reverted. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1608 PS1, Line 1608: } > Refers to "equivalence class" again which is now an undefined term. If you Changed the definition of equivalence class at the top of this file. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1626 PS1, Line 1626: * to cover the known slot equivalences. This function should be called for join > What does this mean? Is the Integer key the SCC id? Yes. Added a comment. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1802 PS1, Line 1802: * Only contains equivalence classes with more than one member. > I've seen this check in a couple of places, would be nice if that could be The handling is not exactly the same among call sites. For example getValueTransferTargets() returns a list of a single element, while here it is skipped. I think the graph data structure shouldn't be aware that there could be out of range slot ids and should fail on out of range parameters instead of handling them. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1952 PS1, Line 1952: return false; > computeValueTransferGraph Done. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@1972 PS1, Line 1972: String tc = reference.print(); > valueTransfers (plural) Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2047 PS1, Line 2047: || tblRef.getJoinOp() == JoinOperator.LEFT_ANTI_JOIN > Comment need updating. The concept of "equivalence class" does not exist an Equivalence class is now defined at the top of this file. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2055 PS1, Line 2055: } > In terms of the g API I think callers should have to care about the existen Removed the getSccId -> getSccMembers indirection. Do you suggest removing the SCC terminology from the graph interfaces? http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Analyzer.java@2070 PS1, Line 2070: } > Shouldn't the destinations already be sorted? Not in current implementation since CondensedTransitiveClosure.getDests() assembles its result on the fly and the result is not sorted there. I haven't figured out why the call site of this function rely on the ordering. The current state is that removing this sort fails some tests (Not different plan. Some runtime filters disappeared.) http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Expr.java File fe/src/main/java/org/apache/impala/analysis/Expr.java: http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/analysis/Expr.java@702 PS1, Line 702: if (this instanceof CastExpr && ((CastExpr)this).isImplicit()) { > It's very unfortunate that we have to create a new pair object to do a Slot Done. An interface ExprBinaryPredicate is added. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java File fe/src/main/java/org/apache/impala/util/Graph.java: http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@1 PS1, Line 1: // Licensed to the Apache Software Foundation (ASF) under one > Apache header Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@18 PS1, Line 18: package org.apache.impala.util; > Comment on the ordering of elements (they are unordered correct?) It was unordered. Since in the new patch binary search is used, it diverges into ordered and unordered types. Comments are added accordingly. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@19 PS1, Line 19: > I feel like this should really be int[][] for maximum efficiency. It genera Implemented util.IntArrayList. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@23 PS1, Line 23: > style: ++i Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@36 PS1, Line 36: */ > Not sure how useful these Preconditions checks are. We'll even get a more m Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@45 PS1, Line 45: BitSet visited = new BitSet(numVertices()); > Does this really need to be public? No but should the visibility be as strict as possible? http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@46 PS1, Line 46: for (int srcVid = 0; srcVid < numVertices(); ++srcVid) { > Don't think this is needed. Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@73 PS1, Line 73: sb.append(i).append(" => "); > I'm worried this might blow the stack for huge graphs - this would be a reg Reimplemented in BFS. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@86 PS1, Line 86: public static boolean validate(Graph a, Graph b) { > Any idea how this compares to the previous transitive closure implementatio In general this is O(V(V+E)) while the old one is O(V^3). The old one might be faster on a really dense graph because of the locality of looping/matrix representation. For sparse graphs this should be faster. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@89 PS1, Line 89: if (!Ordering.natural().sortedCopy(a.dsts(i)).equals( > Create outside of loop and clear() within loop to avoid generating objects Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@107 PS1, Line 107: } > remove "belonging" the "its" already implies that Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@109 PS1, Line 109: @Override > (a list of original vertex ids) Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@110 PS1, Line 110: public int numVertices() { > Can this be an int[][] for the same reasons stated above? Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@114 PS1, Line 114: @Override > Does the HashSet really buy us much versus doing a binary search on the sor The newest patch uses binary search. I did some experiments. At 100,000 scale HastSet<Integer> is 1.6X faster than binary search but uses 13X memory. If neither is satisfying we may consider primitive int hastset as well. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@134 PS1, Line 134: private RandomAccessibleGraph(IntArrayList[] adjList) { > Remove these Preconditions checks - they don't seem useful to me. Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@144 PS1, Line 144: public IntArrayList dsts(int srcVid) { > Remove Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@156 PS1, Line 156: dstVid); > Remove Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@180 PS1, Line 180: > Please avoid chaining multiple non-trivial function calls because that make Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@195 PS1, Line 195: return result; > ... using Tarjan's algorithm. Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@201 PS1, Line 201: */ > sccIds Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@204 PS1, Line 204: } > Needs a better variable name. Not clear what "index" means; "indexes" is be It is the order in which vertices are visited. It's called "NUMBER" in Tarjan's original paper. "DFS preordering index" should be more appropriate. The comments are updated and the variable is renamed into "indexes". http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@205 PS1, Line 205: > lowLinks Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@209 PS1, Line 209: */ > Needs a better var name, what is this counting? Is this something like visi Renamed into dfsOrdinal. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@223 PS1, Line 223: SccCondensedGraph condensed = fromGraph(g); > I think we should use vertexId or vid or similar consistently. All the variable naming is changed to .*vid. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@227 PS1, Line 227: > Brief description should also be above L205 Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@237 PS1, Line 237: * Get an array of vertices ids in the scc. The caller shouldn't modify the returned > int[] sccId Done. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@239 PS1, Line 239: * Time complexity: O(1) > index[vertex] = lowLink[vertex] = indexCounter; Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@243 PS1, Line 243: } > A few simple comments here and there would make this code easier to follow. Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@244 PS1, Line 244: > I'm worried that recursion might blow the stack and lead to a regression in Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@252 PS1, Line 252: } > Brief comment about this case/code, e.g. "Create an SCC from all elements o Done http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@253 PS1, Line 253: > use while() The for loop no longer exists. http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@265 PS1, Line 265: * @param g The input graph. > Project the original graph 'g' to a new graph in SCC space. Done. (The term "project" is removed since it basically means the same thing as "condense"). http://gerrit.cloudera.org:8080/#/c/8317/1/fe/src/main/java/org/apache/impala/util/Graph.java@274 PS1, Line 274: int[] indexes = new int[g.numVertices()]; > Instead of having a separate normalize(), can't you remove duplicates right Done -- To view, visit http://gerrit.cloudera.org:8080/8317 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If4cb1d8be46efa8fd61a97048cc79dabe2ffa51a Gerrit-Change-Number: 8317 Gerrit-PatchSet: 2 Gerrit-Owner: Tianyi Wang <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Tianyi Wang <[email protected]> Gerrit-Comment-Date: Wed, 25 Oct 2017 01:43:48 +0000 Gerrit-HasComments: Yes
