Wenhai Li has posted comments on this change.

Change subject: Applied the multiway fuzzyjoin based on the prefix-based join 
and the selectFuzzyJoin testCases.
......................................................................


Patch Set 10:

(15 comments)

@Taewoo,

Sorry, I did not know you cann't see the comments without publishing. :)

OK, published. Maybe we can talk with some detail with running example. You 
know, for those specified detail, these pessimistic methods can always forbid 
the exception of the dynamic variables generation.

https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java
File 
asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/FuzzyJoinRule.java:

Line 83:     // Step1: Initial the host embedding aql to substitute the fuzzy 
euqal condition such as $r.a ~= $s.b
> What does "Initial the host embedding aql" mean?
This is grammatical mistake. Here, initial = initialize. This is a substituted 
template to ingest the leaves of the fuzzyjoin and translate them into the 
equal fuzzyjoin subplan. It means that the following template AQLPLUS will be 
embedded regardless the token functions.


Line 88:             + "((#LEFT_%d_0), " + "  (join((#RIGHT_%d_0), "
> %d change is necessary because?
I our old version, we always duplicate the vars with a 
LogicalOperatorDeepCopyVaraiblesVisitor other than the newest 
LogicalOperatorDeepCopyWithNewVaraiblesVisitor, we will get duplicated vars if 
not differentiate the vars like this. Now, I "have removed" this var in the 
version 12 of this patch.

Even by this, the following remarks are also useful for the function of this 
variable.
As will be replaced by the nBranches, this %d formatter is used to 
differentiate the VarIdentifier (line 311-338 in Step 3.1-3.3) in the 
multi-fuzzyjoin context, where additional leaf vars will be introduced into the 
translator. We need to different the leaf identifier once this rule is 
triggered multiple times in the multi-fuzzyjoin. Otherwise, we will get the 
same var identifier for the different fuzzyjoin branches. For example,
for $r in dataset DBLP
for $s in dataset CSX\
for $t in dataset ACM
where word-tokens($r.title) ~= word-tokens($s.title) and word-tokens($r.title) 
~= word-tokens($t.title)
return {"rid":$r.id, "sid":$s.id, "tid":$t.id}
, we will get a plan abstracted as (word-token($r.title) fj 
word-tokens($s.title)) => (word-token($r.title) fj word-tokens($t.title)
. The var substitution will make the three identical leaves different.


Line 201:         VariableUtilities.getLiveVariables(rightInputOp, liveVars);
> What do we want to achieve for this change (line 201 - 220)? Can you put mo
This is a segment especially for the SELECT based on fuzzy join on the same 
table pair. Let take an example for this:
for $d in dataset DBLP
for $t in dataset CSX
where word-tokens($d.title) ~= word-tokens($t.title)
and word-tokens($d.authors) ~= word-tokens($t.authors)
return {"did": $d.id, "tid": $t.id}
Actually, these two fuzzyjoin are placed on the same table pair. Of course we 
can fire two fuzzyjoins consequently. Nevertheless, a cheaper method is to 
setup a fuzzyjoin on the first condition and conduct a SELECT towards its 
result. Is well known that the fuzzy join will generally propagate the input 
cardinalities of the join  branches.

Here, pk are appended onto the static previousPK list after each round of 
fuzzyjoin are ingested. We can differentiate whether a same table-pair has been 
translated.


Line 250:         if (constVal.getObject() instanceof AFloat) {
> instanceOf can be replaced by using constVal.getObject() and BuiltinType.
Done


Line 253:             simThreshold = 
FuzzyUtils.getSimThreshold(metadataProvider);
> Do we have this case?
By default, without user input?


Line 342:         for (int i = 1; i < 4; i++) {
> Why this is fixed number? i < 4?
right side will be used as the token order source, right record order and 
surrogate join. Please refer to the template AQLPLUS.


Line 378:         // Step 5. Bind the plan to the father op referred by the 
below opRef.
> Can we put an example here? "the father op referred by the below opRef." is
Also refer an example.
Here, suppose the fuzzyjoin are word-tokens(DBLP.title) fj 
word-tokens(CSX.title) return {DBLP.id, CSX.id}. The father means the operator 
depending on fj, i.e. return {DBLP.id, CSX.id}. In general, this "parent" is 
the upstream operator that referred to fj in the original plan.


Line 408:     private void setConditionForLeftOuterJoin(LeftOuterJoinOperator 
topJoin, Mutable<ILogicalExpression> expRef) {
> In the above, can we put some comments? What does this function do?
Done


Line 441:                     Mutable<ILogicalExpression> expRefRet = 
getSimilarityExpression(arg);
> Is this change necessary?
It is up to the reviewer. :)


Line 451:     private List<LogicalVariable> 
findPrimaryKeysInSubplan(Collection<LogicalVariable> liveVars,
> In the above, can we put some comments? What does this function do?
Done


https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/subplan/InlineSubplanInputForNestedTupleSourceRule.java
File 
asterixdb/asterix-algebra/src/main/java/org/apache/asterix/optimizer/rules/subplan/InlineSubplanInputForNestedTupleSourceRule.java:

Line 256: public class InlineSubplanInputForNestedTupleSourceRule implements 
IAlgebraicRewriteRule {
> Why the "hasRun" boolean needs to be removed? Can you explain this to me?
Specially, in the multi-fuzzyjoin cases, we have common cases that contain 
multiple nestedtuplesource along a fuzzyjoin branch. We want to flatten all 
those nestedtuplesource operator and translate them into a series of subplans. 
This is a real problem since I can only get this case in the star-fuzzyjoin. 
Maybe we can set up a talk for this.


https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-app/src/main/java/org/apache/asterix/api/common/APIFramework.java
File 
asterixdb/asterix-app/src/main/java/org/apache/asterix/api/common/APIFramework.java:

Line 313:                 QueryLogicalExpressionJobGen.INSTANCE));
> Not necessary change - can you check your formatter?
Done


https://asterix-gerrit.ics.uci.edu/#/c/1076/10/asterixdb/asterix-app/src/test/resources/optimizerts/queries/fj-dblp-csx-hybrid.aql
File 
asterixdb/asterix-app/src/test/resources/optimizerts/queries/fj-dblp-csx-hybrid.aql:

Line 18:  */
> Can you a description of this test? What it is intended for? And the expect
Done


https://asterix-gerrit.ics.uci.edu/#/c/1076/10/hyracks-fullstack/algebricks/algebricks-core/src/main/java/org/apache/hyracks/algebricks/core/algebra/operators/logical/visitors/IsomorphismUtilities.java
File 
hyracks-fullstack/algebricks/algebricks-core/src/main/java/org/apache/hyracks/algebricks/core/algebra/operators/logical/visitors/IsomorphismUtilities.java:

Line 81:     private static ILogicalOperator 
extractPKProduction(ILogicalOperator root, LogicalVariable pk)
> Please add some comments in the above - what does this function do?
Done


Line 105:     public static void mergeHomogeneousPK(ILogicalOperator op, 
List<LogicalVariable> pks) throws AlgebricksException {
> Please add some comments in the above - what does this function do?
Done


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/1076
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8736f104905eeda763d39709e002c2b9629278cc
Gerrit-PatchSet: 10
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Wenhai Li <[email protected]>
Gerrit-Reviewer: Chen Li <[email protected]>
Gerrit-Reviewer: Jenkins <[email protected]>
Gerrit-Reviewer: Taewoo Kim <[email protected]>
Gerrit-Reviewer: Wenhai Li <[email protected]>
Gerrit-HasComments: Yes

Reply via email to