zabetak commented on code in PR #5037:
URL: https://github.com/apache/hive/pull/5037#discussion_r1475779718
##########
ql/src/java/org/apache/hadoop/hive/ql/plan/mapper/PlanMapper.java:
##########
@@ -200,54 +199,46 @@ public void merge(Object o1, Object o2) {
}
private void link(Object o1, Object o2, boolean mayMerge) {
-
- Set<Object> keySet = Collections.newSetFromMap(new IdentityHashMap<Object,
Boolean>());
- keySet.add(o1);
- keySet.add(o2);
- keySet.add(getKeyFor(o1));
- keySet.add(getKeyFor(o2));
-
- Set<EquivGroup> mGroups = Collections.newSetFromMap(new
IdentityHashMap<EquivGroup, Boolean>());
-
- for (Object object : keySet) {
- EquivGroup group = objectMap.get(object);
- if (group != null) {
- mGroups.add(group);
- }
+ // Caches signatures on the first access. A signature of an Operator could
change as optimizers could mutate it,
+ // keeping its semantics. The current implementation caches the signature
before optimizations so that we can
+ // link Operators with their signatures at consistent timing.
+ registerSignature(o1);
+ registerSignature(o2);
+ final EquivGroup linkedGroup1 = objectMap.get(o1);
+ final EquivGroup linkedGroup2 = objectMap.get(o2);
+
+ if (linkedGroup1 == null && linkedGroup2 == null) {
+ final EquivGroup group = new EquivGroup();
+ group.add(o1);
+ group.add(o2);
+ groups.add(group);
+ return;
}
- if (mGroups.size() > 1) {
- if (!mayMerge) {
- throw new RuntimeException("equivalence mapping violation");
- }
- EquivGroup newGrp = new EquivGroup();
- newGrp.add(o1);
- newGrp.add(o2);
- for (EquivGroup g : mGroups) {
- for (Object o : g.members) {
- newGrp.add(o);
- }
- }
- groups.add(newGrp);
- groups.removeAll(mGroups);
- } else {
- EquivGroup targetGroup = mGroups.isEmpty() ? new EquivGroup() :
mGroups.iterator().next();
- groups.add(targetGroup);
- targetGroup.add(o1);
- targetGroup.add(o2);
+ if (linkedGroup1 == null || linkedGroup2 == null || linkedGroup1 ==
linkedGroup2) {
+ final EquivGroup group = linkedGroup1 != null ? linkedGroup1 :
linkedGroup2;
+ group.add(o1);
+ group.add(o2);
+ return;
}
+ if (!mayMerge) {
+ throw new RuntimeException(String.format(
+ "Failed to link %s and %s. This error mostly means a bug of Hive",
Review Comment:
nit: An uncaught `RuntimeException` almost always denotes a bug of the
application. I don't think we need to make the exception that verbose stating
the obvious.
Adding the operators which led to exception in the message is good idea,
makes sense.
##########
ql/src/test/queries/clientpositive/cbo_cte_materialization.q:
##########
@@ -0,0 +1,27 @@
+--! qt:dataset:src
+
+set hive.optimize.cte.materialize.threshold=1;
+set hive.optimize.cte.materialize.full.aggregate.only=false;
+
+EXPLAIN CBO
+WITH materialized_cte AS (
+ SELECT key, value FROM src WHERE key != '100'
+),
+another_materialized_cte AS (
+ SELECT key, value FROM src WHERE key != '100'
+)
+SELECT a.key, a.value, b.key, b.value
+FROM materialized_cte a
+JOIN another_materialized_cte b ON a.key = b.key
+ORDER BY a.key;
+
+WITH materialized_cte AS (
+ SELECT key, value FROM src WHERE key != '100'
+),
+another_materialized_cte AS (
+ SELECT key, value FROM src WHERE key != '100'
+)
+SELECT a.key, a.value, b.key, b.value
+FROM materialized_cte a
+JOIN another_materialized_cte b ON a.key = b.key
+ORDER BY a.key;
Review Comment:
The problem is in the compilation phase so I don't think we need to actually
run the query. Consider dropping the execution. If you opt to keep it then I
would suggest crafting a much simpler test (without 1K lines in the output). It
would be nice to keep the execution time of our test suite as low as possible.
Moreover, it is not easy to see what the `src` table contains so verifying that
the result is indeed correct is cumbersome.
##########
ql/src/java/org/apache/hadoop/hive/ql/plan/mapper/PlanMapper.java:
##########
@@ -200,54 +199,46 @@ public void merge(Object o1, Object o2) {
}
private void link(Object o1, Object o2, boolean mayMerge) {
-
- Set<Object> keySet = Collections.newSetFromMap(new IdentityHashMap<Object,
Boolean>());
- keySet.add(o1);
- keySet.add(o2);
- keySet.add(getKeyFor(o1));
- keySet.add(getKeyFor(o2));
-
- Set<EquivGroup> mGroups = Collections.newSetFromMap(new
IdentityHashMap<EquivGroup, Boolean>());
-
- for (Object object : keySet) {
- EquivGroup group = objectMap.get(object);
- if (group != null) {
- mGroups.add(group);
- }
+ // Caches signatures on the first access. A signature of an Operator could
change as optimizers could mutate it,
+ // keeping its semantics. The current implementation caches the signature
before optimizations so that we can
+ // link Operators with their signatures at consistent timing.
+ registerSignature(o1);
+ registerSignature(o2);
Review Comment:
Linking operators implicitly (using signatures) was added by @kgyrtkirk in
HIVE-18926. This change reverts that logic and takes signatures out of the
equation for determining equivalent groups.
My understanding is that if this change goes in operators can be linked with
signatures **only** explicitly. I don't see test failures and .q.out changes
due to this so it seems that the change does not have a big impact on existing
use-cases. Moreover, it fixes some known compilation crashes so I consider this
an improvement over the current situation.
I don't see a significant risk in merging this change but I will let
@kgyrtkirk comment in case I am missing something.
Assuming that we do not allow implicit linking via signatures I don't know
what's the point of registering the signature at this stage. My understanding
is that the cache is meant to improve performance not a means to perform
equivalence checks. If we don't make use of the signature here then I don't
think we should compute it. @okumin although you added some comments justifying
the registration I would appreciate some additional clarifications regarding
these additions.
##########
ql/src/test/queries/clientpositive/cbo_cte_materialization.q:
##########
@@ -0,0 +1,27 @@
+--! qt:dataset:src
+
+set hive.optimize.cte.materialize.threshold=1;
+set hive.optimize.cte.materialize.full.aggregate.only=false;
+
+EXPLAIN CBO
+WITH materialized_cte AS (
+ SELECT key, value FROM src WHERE key != '100'
+),
+another_materialized_cte AS (
+ SELECT key, value FROM src WHERE key != '100'
+)
+SELECT a.key, a.value, b.key, b.value
+FROM materialized_cte a
+JOIN another_materialized_cte b ON a.key = b.key
+ORDER BY a.key;
Review Comment:
Consider adding (or replacing this with) a traditional `EXPLAIN` where we
can see the effect of `hive.optimize.cte.materialize.threshold` and the full
plan for the materialized ctes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]