Wenhai Li has posted comments on this change. Change subject: Applied the multiway fuzzyjoin based on the prefix-based join and the selectFuzzyJoin testCases. ......................................................................
Patch Set 10: (15 comments) @Taewoo, Sorry, I did not know you cann't see the comments without publishing. :) OK, published. Maybe we can talk with some detail with running example. You know, for those specified detail, these pessimistic methods can always forbid the exception of the dynamic variables generation. https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java File asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java: Line 83: // Step1: Initial the host embedding aql to substitute the fuzzy euqal condition such as $r.a ~= $s.b > What does "Initial the host embedding aql" mean? This is grammatical mistake. Here, initial = initialize. This is a substituted template to ingest the leaves of the fuzzyjoin and translate them into the equal fuzzyjoin subplan. It means that the following template AQLPLUS will be embedded regardless the token functions. Line 88: + "((#LEFT_%d_0), " + " (join((#RIGHT_%d_0), " > %d change is necessary because? I our old version, we always duplicate the vars with a LogicalOperatorDeepCopyVaraiblesVisitor other than the newest LogicalOperatorDeepCopyWithNewVaraiblesVisitor, we will get duplicated vars if not differentiate the vars like this. Now, I "have removed" this var in the version 12 of this patch. Even by this, the following remarks are also useful for the function of this variable. As will be replaced by the nBranches, this %d formatter is used to differentiate the VarIdentifier (line 311-338 in Step 3.1-3.3) in the multi-fuzzyjoin context, where additional leaf vars will be introduced into the translator. We need to different the leaf identifier once this rule is triggered multiple times in the multi-fuzzyjoin. Otherwise, we will get the same var identifier for the different fuzzyjoin branches. For example, for $r in dataset DBLP for $s in dataset CSX\ for $t in dataset ACM where word-tokens($r.title) ~= word-tokens($s.title) and word-tokens($r.title) ~= word-tokens($t.title) return {"rid":$r.id, "sid":$s.id, "tid":$t.id} , we will get a plan abstracted as (word-token($r.title) fj word-tokens($s.title)) => (word-token($r.title) fj word-tokens($t.title) . The var substitution will make the three identical leaves different. Line 201: VariableUtilities.getLiveVariables(rightInputOp, liveVars); > What do we want to achieve for this change (line 201 - 220)? Can you put mo This is a segment especially for the SELECT based on fuzzy join on the same table pair. Let take an example for this: for $d in dataset DBLP for $t in dataset CSX where word-tokens($d.title) ~= word-tokens($t.title) and word-tokens($d.authors) ~= word-tokens($t.authors) return {"did": $d.id, "tid": $t.id} Actually, these two fuzzyjoin are placed on the same table pair. Of course we can fire two fuzzyjoins consequently. Nevertheless, a cheaper method is to setup a fuzzyjoin on the first condition and conduct a SELECT towards its result. Is well known that the fuzzy join will generally propagate the input cardinalities of the join branches. Here, pk are appended onto the static previousPK list after each round of fuzzyjoin are ingested. We can differentiate whether a same table-pair has been translated. Line 250: if (constVal.getObject() instanceof AFloat) { > instanceOf can be replaced by using constVal.getObject() and BuiltinType. Done Line 253: simThreshold = FuzzyUtils.getSimThreshold(metadataProvider); > Do we have this case? By default, without user input? Line 342: for (int i = 1; i < 4; i++) { > Why this is fixed number? i < 4? right side will be used as the token order source, right record order and surrogate join. Please refer to the template AQLPLUS. Line 378: // Step 5. Bind the plan to the father op referred by the below opRef. > Can we put an example here? "the father op referred by the below opRef." is Also refer an example. Here, suppose the fuzzyjoin are word-tokens(DBLP.title) fj word-tokens(CSX.title) return {DBLP.id, CSX.id}. The father means the operator depending on fj, i.e. return {DBLP.id, CSX.id}. In general, this "parent" is the upstream operator that referred to fj in the original plan. Line 408: private void setConditionForLeftOuterJoin(LeftOuterJoinOperator topJoin, Mutable<ILogicalExpression> expRef) { > In the above, can we put some comments? What does this function do? Done Line 441: Mutable<ILogicalExpression> expRefRet = getSimilarityExpression(arg); > Is this change necessary? It is up to the reviewer. :) Line 451: private List<LogicalVariable> findPrimaryKeysInSubplan(Collection<LogicalVariable> liveVars, > In the above, can we put some comments? What does this function do? Done https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/subplan/InlineSubplanInputForNestedTupleSourceRule.java File asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/subplan/InlineSubplanInputForNestedTupleSourceRule.java: Line 256: public class InlineSubplanInputForNestedTupleSourceRule implements IAlgebraicRewriteRule { > Why the "hasRun" boolean needs to be removed? Can you explain this to me? Specially, in the multi-fuzzyjoin cases, we have common cases that contain multiple nestedtuplesource along a fuzzyjoin branch. We want to flatten all those nestedtuplesource operator and translate them into a series of subplans. This is a real problem since I can only get this case in the star-fuzzyjoin. Maybe we can set up a talk for this. https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/common/APIFramework.java File asterixdb/asterix-app/src/main/java/org/apache/asterix/api/common/APIFramework.java: Line 313: QueryLogicalExpressionJobGen.INSTANCE)); > Not necessary change - can you check your formatter? Done https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-app/src/test/resources/optimizerts/queries/fj-dblp-csx-hybrid.aql File asterixdb/asterix-app/src/test/resources/optimizerts/queries/fj-dblp-csx-hybrid.aql: Line 18: */ > Can you a description of this test? What it is intended for? And the expect Done https://asterix-gerrit.ics.uci.edu/#/c/1076/10/hyracks-fullstack/algebricks/algebricks-core/src/main/java/org/apache/hyracks/algebricks/core/algebra/operators/logical/visitors/IsomorphismUtilities.java File hyracks-fullstack/algebricks/algebricks-core/src/main/java/org/apache/hyracks/algebricks/core/algebra/operators/logical/visitors/IsomorphismUtilities.java: Line 81: private static ILogicalOperator extractPKProduction(ILogicalOperator root, LogicalVariable pk) > Please add some comments in the above - what does this function do? Done Line 105: public static void mergeHomogeneousPK(ILogicalOperator op, List<LogicalVariable> pks) throws AlgebricksException { > Please add some comments in the above - what does this function do? Done -- To view, visit https://asterix-gerrit.ics.uci.edu/1076 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8736f104905eeda763d39709e002c2b9629278cc Gerrit-PatchSet: 10 Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Owner: Wenhai Li <[email protected]> Gerrit-Reviewer: Chen Li <[email protected]> Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: Taewoo Kim <[email protected]> Gerrit-Reviewer: Wenhai Li <[email protected]> Gerrit-HasComments: Yes
