Change in asterixdb[master]: Applied the multiway fuzzyjoin based on the prefix-based joi...

Wenhai Li (Code Review) Mon, 07 Nov 2016 08:25:16 -0800

Wenhai Li has posted comments on this change.

Change subject: Applied the multiway fuzzyjoin based on the prefix-based join 
and the selectFuzzyJoin testCases.
......................................................................



Patch Set 21:

(10 comments)

@Taewoo,

Sorry, I did not know you cann't see the comments without publishing. :)

OK, published. Maybe we can talk with some detail with running example. You 
know, for those specified detail, these pessimistic methods can always forbid 
the exception of the dynamic variables generation.

https://asterix-gerrit.ics.uci.edu/#/c/1076/21/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
File 
asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java:

Line 197:         // To handle multiple fuzzyjoin conditions on the same table 
pair, this rule differentiate the PKs
> What do we mean by [the same table pair] here?
Query example:
use dataverse fuzzytest;
for $d in dataset DBLP
for $t in dataset CSX
where word-tokens($d.title) ~= word-tokens($t.title) and 
word-tokens($d.authors) ~= word-tokens($t.authors)
return {"did": $d.tid, "tid": $t.tid}

Explain in general:

The first round has the following functional dependencies:
1. $d.title -> $d.tid
2. $t.title -> $t.tid
which means $d.title is derived from table $d and $t.title is derived from $t, 
respectively.

After iteration, in the second round, we have the following functional 
dependencies:
1. $d.authors -> $d.tid
2. $t.authors -> $t.tid
the both right parts have been maintained in the previousPK.

Result:
In this context, we just give two fuzzy join condition on a same table pair, 
and the second fuzzy join SHOULD be explained as a fuzzy select based on the 
result of the first fuzzy join.

Handle strategy:
Just omit the second fuzzy join, other than explain it as another fuzzy join to 
avoid the wrong substitution based on the fixed template. (Since we have 
substituted the two table branches in the first fuzzy join.)


Line 207:         Set<LogicalVariable> currentPK = new HashSet<>();
> I'm confused about currentPK and previousPK concept. Can you explain more?
In general, each round of potential substitution will scan all its branch 
variables to look forward where are they coming from.

currentPK is the primary key of all the primary keys of the current ~='s 
branches.

previousPK is the primary key of all the primary keys of the 
scanned/substituted ~='s branches.

If they are equal, we claim it's the duplicated fuzzyjoin conditions on a same 
table pair.

i.e.

use dataverse fuzzytest; for $d in dataset DBLP for $t in dataset CSX where 
word-tokens($d.title) ~= word-tokens($t.title) and word-tokens($d.authors) ~= 
word-tokens($t.authors) return {"did": $d.tid, "tid": $t.tid}

$d.title and $t.title as well as their PKs are the previous derivations, and 
$d.authors and $t.authors are the current derivations.


Line 210:         // If PKs derived from the both branches are SAME as a 
previous fuzzyjoin, we treat this ~= as a select.
> Here, "previous fuzzy join" means? Can you present an example?
Reference the comment on 207's word-tokens($d.title) ~= word-tokens($t.title).


Line 251:         ConstantExpression constExpr = (ConstantExpression) inputExp2;
> The reason of this change - not using FuzzyUtils.getSimThreshold()?
At least one case involved:

similarity-jaccard() <> threshold

to get threshold, I think FuzzyUtils.getSimThreshold is not enough.


Line 268:                 break;
> Have we fixed the bug that mentioned in the previous TODO? Can we explain m
If only permuting the three for clauses in the mentioned testCase, the results 
in this code-branch are consistent. Also, if we change the join conditions in 
this query, I think it's not an issues, but a semantic problem. I guess the old 
issues as commented left-red is derived from the flatten process or order 
issue. But anyway, it's disappeared in this branch on current master.


Line 317:         translator.addVariableToMetaScope(new 
VarIdentifier("$$LEFT_0"), leftInputVar);
> What's the difference between # and $$? I think I saw this in the Vernica's
Also, this issue is derived from the new master at about one year ago. In 
short, "#" is for operator and "$$" is for vars. In addition, the translator 
will be triggered several times, each round for a legal ~= (a currentPK is not 
the same one of a previous PKs sets). You know, we need to increment the vars 
counter in each round after we generate new vars for the substituting branches' 
vars. As well as the following line 356, we can thus generate identical vars 
for all rounds of var generation requests.


Line 329:         // Step3.3. the suffix 0-3 is used for identifying the 
different level of variable references.
> Can you present an example? different levels?
Nothing special, it is just for the anchor "#LEFT_1" in line 90 of the AQL 
template, to generate the vars for this anchor.


Line 356:         counter.set(counter.get() + incrementedCounter);
> How is this counter used?
Refer to the comments in line 317.


Line 407:     // of expRef, we need to add the full condition 
expRef\getItemExprRef into the top-level operator of the plan.
> Can you present an example here?
use dataverse fuzzytest;

for $d in dataset DBLP 
for $t in dataset CSX 
for $r in dataset ACM
where word-tokens($d.title) ~= word-tokens($t.title) 
and $d.year < $t.year
and word-tokens($t.authors) ~= word-tokens($r.authors)
and $t.year < $r.year
return {"did": $d.tid, "tid": $t.tid, "rid": $r.tid}

Here, $t.year < $r.year will be pushed on the new topJoinOp of the second fuzzy 
join.

In general, this method is to extract the extra conditions besides the fuzzy 
join onto the new topJoinOp of the substituted plan.


Line 426:         topJoin.getCondition().setValue(andFunc);
> Why is this required for left-outer-join?
I think directly applying Select above loj is not equal to inline the extra 
condition within the join condition, right?


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/1076
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8736f104905eeda763d39709e002c2b9629278cc
Gerrit-PatchSet: 21
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Wenhai Li <[email protected]>
Gerrit-Reviewer: Chen Li <[email protected]>
Gerrit-Reviewer: Jenkins <[email protected]>
Gerrit-Reviewer: Taewoo Kim <[email protected]>
Gerrit-Reviewer: Wenhai Li <[email protected]>
Gerrit-HasComments: Yes

Change in asterixdb[master]: Applied the multiway fuzzyjoin based on the prefix-based joi...

Reply via email to