Wenhai Li has posted comments on this change. Change subject: Applied the multiway fuzzyjoin based on the prefix-based join and the selectFuzzyJoin testCases. ......................................................................
Patch Set 21: (10 comments) @Taewoo, Sorry, I did not know you cann't see the comments without publishing. :) OK, published. Maybe we can talk with some detail with running example. You know, for those specified detail, these pessimistic methods can always forbid the exception of the dynamic variables generation. https://asterix-gerrit.ics.uci.edu/#/c/1076/21/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java File asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java: Line 197: // To handle multiple fuzzyjoin conditions on the same table pair, this rule differentiate the PKs > What do we mean by [the same table pair] here? Query example: use dataverse fuzzytest; for $d in dataset DBLP for $t in dataset CSX where word-tokens($d.title) ~= word-tokens($t.title) and word-tokens($d.authors) ~= word-tokens($t.authors) return {"did": $d.tid, "tid": $t.tid} Explain in general: The first round has the following functional dependencies: 1. $d.title -> $d.tid 2. $t.title -> $t.tid which means $d.title is derived from table $d and $t.title is derived from $t, respectively. After iteration, in the second round, we have the following functional dependencies: 1. $d.authors -> $d.tid 2. $t.authors -> $t.tid the both right parts have been maintained in the previousPK. Result: In this context, we just give two fuzzy join condition on a same table pair, and the second fuzzy join SHOULD be explained as a fuzzy select based on the result of the first fuzzy join. Handle strategy: Just omit the second fuzzy join, other than explain it as another fuzzy join to avoid the wrong substitution based on the fixed template. (Since we have substituted the two table branches in the first fuzzy join.) Line 207: Set<LogicalVariable> currentPK = new HashSet<>(); > I'm confused about currentPK and previousPK concept. Can you explain more? In general, each round of potential substitution will scan all its branch variables to look forward where are they coming from. currentPK is the primary key of all the primary keys of the current ~='s branches. previousPK is the primary key of all the primary keys of the scanned/substituted ~='s branches. If they are equal, we claim it's the duplicated fuzzyjoin conditions on a same table pair. i.e. use dataverse fuzzytest; for $d in dataset DBLP for $t in dataset CSX where word-tokens($d.title) ~= word-tokens($t.title) and word-tokens($d.authors) ~= word-tokens($t.authors) return {"did": $d.tid, "tid": $t.tid} $d.title and $t.title as well as their PKs are the previous derivations, and $d.authors and $t.authors are the current derivations. Line 210: // If PKs derived from the both branches are SAME as a previous fuzzyjoin, we treat this ~= as a select. > Here, "previous fuzzy join" means? Can you present an example? Reference the comment on 207's word-tokens($d.title) ~= word-tokens($t.title). Line 251: ConstantExpression constExpr = (ConstantExpression) inputExp2; > The reason of this change - not using FuzzyUtils.getSimThreshold()? At least one case involved: similarity-jaccard() <> threshold to get threshold, I think FuzzyUtils.getSimThreshold is not enough. Line 268: break; > Have we fixed the bug that mentioned in the previous TODO? Can we explain m If only permuting the three for clauses in the mentioned testCase, the results in this code-branch are consistent. Also, if we change the join conditions in this query, I think it's not an issues, but a semantic problem. I guess the old issues as commented left-red is derived from the flatten process or order issue. But anyway, it's disappeared in this branch on current master. Line 317: translator.addVariableToMetaScope(new VarIdentifier("$$LEFT_0"), leftInputVar); > What's the difference between # and $$? I think I saw this in the Vernica's Also, this issue is derived from the new master at about one year ago. In short, "#" is for operator and "$$" is for vars. In addition, the translator will be triggered several times, each round for a legal ~= (a currentPK is not the same one of a previous PKs sets). You know, we need to increment the vars counter in each round after we generate new vars for the substituting branches' vars. As well as the following line 356, we can thus generate identical vars for all rounds of var generation requests. Line 329: // Step3.3. the suffix 0-3 is used for identifying the different level of variable references. > Can you present an example? different levels? Nothing special, it is just for the anchor "#LEFT_1" in line 90 of the AQL template, to generate the vars for this anchor. Line 356: counter.set(counter.get() + incrementedCounter); > How is this counter used? Refer to the comments in line 317. Line 407: // of expRef, we need to add the full condition expRef\getItemExprRef into the top-level operator of the plan. > Can you present an example here? use dataverse fuzzytest; for $d in dataset DBLP for $t in dataset CSX for $r in dataset ACM where word-tokens($d.title) ~= word-tokens($t.title) and $d.year < $t.year and word-tokens($t.authors) ~= word-tokens($r.authors) and $t.year < $r.year return {"did": $d.tid, "tid": $t.tid, "rid": $r.tid} Here, $t.year < $r.year will be pushed on the new topJoinOp of the second fuzzy join. In general, this method is to extract the extra conditions besides the fuzzy join onto the new topJoinOp of the substituted plan. Line 426: topJoin.getCondition().setValue(andFunc); > Why is this required for left-outer-join? I think directly applying Select above loj is not equal to inline the extra condition within the join condition, right? -- To view, visit https://asterix-gerrit.ics.uci.edu/1076 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8736f104905eeda763d39709e002c2b9629278cc Gerrit-PatchSet: 21 Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Owner: Wenhai Li <[email protected]> Gerrit-Reviewer: Chen Li <[email protected]> Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: Taewoo Kim <[email protected]> Gerrit-Reviewer: Wenhai Li <[email protected]> Gerrit-HasComments: Yes
