zabetak commented on code in PR #2966:
URL: https://github.com/apache/hive/pull/2966#discussion_r857510400
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -145,19 +138,25 @@ private ImmutableList<RexNode> getValidPreds(RelNode
child, Set<String> predicat
}
}
- // We need to filter i) those that have been pushed already as stored in
the join,
- // ii) those that were already in the subtree rooted at child,
- // iii) predicates that are not safe for transitive inference.
+ // We need to filter:
+ // i) those that have been pushed already as stored in the join,
+ // ii) those that were already in the subtree rooted at child.
+ List<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+ // If we run the rule in conservative mode, we also filter:
+ // iii) predicates that are not safe for transitive inference.
//
// There is no formal definition of safety for predicate inference, only
an empirical one.
// An unsafe predicate in this context is one that when pushed across join
operands, can lead
// to redundant predicates that cannot be simplified (by means of
predicates merging with other existing ones).
// This situation can lead to an OOM for cases where lack of
simplification allows inferring new predicates
- // (from LHS to RHS) recursively, predicates which are redundant, but that
RexSimplify cannot handle.
+ // (from LHS to RHS and vice-versa) recursively, predicates which are
redundant, but that RexSimplify cannot handle.
// This notion can be relaxed as soon as RexSimplify gets more powerful,
and it can handle such cases.
- List<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child,
valids).stream()
- .filter(unsafeOperatorsFinder::isSafe)
- .collect(Collectors.toList());
+ if (HiveConf.getBoolVar(conf,
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {
Review Comment:
Good idea to add a configuration flag for this change.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -143,28 +138,82 @@ private ImmutableList<RexNode>
getValidPreds(RelOptCluster cluster, RelNode chil
}
}
- // We need to filter i) those that have been pushed already as stored in
the join,
- // and ii) those that were already in the subtree rooted at child
- ImmutableList<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
- child, valids);
- return toPush;
+ // We need to filter:
+ // i) those that have been pushed already as stored in the join,
+ // ii) those that were already in the subtree rooted at child.
+ List<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+ // If we run the rule in conservative mode, we also filter:
+ // iii) predicates that are not safe for transitive inference.
+ //
+ // There is no formal definition of safety for predicate inference, only
an empirical one.
+ // An unsafe predicate in this context is one that when pushed across join
operands, can lead
+ // to redundant predicates that cannot be simplified (by means of
predicates merging with other existing ones).
+ // This situation can lead to an OOM for cases where lack of
simplification allows inferring new predicates
+ // (from LHS to RHS and vice-versa) recursively, predicates which are
redundant, but that RexSimplify cannot handle.
+ // This notion can be relaxed as soon as RexSimplify gets more powerful,
and it can handle such cases.
+ if (HiveConf.getBoolVar(conf,
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {
+ toPush = toPush.stream()
+ .filter(unsafeOperatorsFinder::isSafe)
+ .collect(Collectors.toList());
+ }
+
+ return ImmutableList.copyOf(toPush);
}
- private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex,
RelDataType rType) {
- RexNode typeSafeRex = rex;
- if ((typeSafeRex instanceof RexCall) &&
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
- RexBuilder rb = cluster.getRexBuilder();
- List<RexNode> fixedPredElems = new ArrayList<RexNode>();
- RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
- RexUtil.types(((RexCall) rex).getOperands()));
- for (RexNode rn : ((RexCall) rex).getOperands()) {
- fixedPredElems.add(rb.ensureType(commonType, rn, true));
- }
+ //~ Inner Classes ----------------------------------------------------------
+
+ /**
+ * Finds unsafe operators in an expression (at any level of nesting).
+ * At the moment, the only unsafe operator is OR.
+ *
+ * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level
expression)
+ * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+ * this is equivalent to OR((<>($0, $1), IS NULL($2))
+ * Example 3: AND(OR(=($0, $1), IS NOT NULL($2)))) (OR in inner expression)
+ */
+ private static class UnsafeOperatorsFinder extends RexVisitorImpl<Void> {
Review Comment:
If I understand well this class tries to find disjunctions inside an
expression. Instead of using the broader term "unsafe" I would focus and rename
usages/documentation to be more explicit. This would make the code easier to
understand and plan changes due to this self-explained.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -65,21 +66,15 @@
*/
public class HiveJoinPushTransitivePredicatesRule extends RelOptRule {
- public static final HiveJoinPushTransitivePredicatesRule INSTANCE_JOIN =
- new HiveJoinPushTransitivePredicatesRule(HiveJoin.class,
HiveRelFactories.HIVE_FILTER_FACTORY);
-
- public static final HiveJoinPushTransitivePredicatesRule INSTANCE_SEMIJOIN =
- new HiveJoinPushTransitivePredicatesRule(HiveSemiJoin.class,
HiveRelFactories.HIVE_FILTER_FACTORY);
-
- public static final HiveJoinPushTransitivePredicatesRule INSTANCE_ANTIJOIN =
- new HiveJoinPushTransitivePredicatesRule(HiveAntiJoin.class,
HiveRelFactories.HIVE_FILTER_FACTORY);
-
+ private final HiveConf conf;
Review Comment:
Instead of keeping the whole config it may be better to just keep the
boolean variable that we need. Reduces the likelihood of a memory leak due to
configuration objects.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -144,27 +146,70 @@ public void onMatch(RelOptRuleCall call) {
}
// We need to filter i) those that have been pushed already as stored in
the join,
- // and ii) those that were already in the subtree rooted at child
- ImmutableList<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
- child, valids);
- return toPush;
+ // ii) those that were already in the subtree rooted at child,
+ // iii) predicates that are not safe for transitive inference.
+ //
+ // There is no formal definition of safety for predicate inference, only
an empirical one.
Review Comment:
By additional dependency I meant the coupling with `RexSimplify`; the rule
does (or doesn't do) something based on what the simplifier can handle. Anyways
my comment was not a blocker, but rather a remark to try to address this in a
more general fashion in a follow-up. For the moment, I am marking this thread
as resolved.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -143,28 +138,82 @@ private ImmutableList<RexNode>
getValidPreds(RelOptCluster cluster, RelNode chil
}
}
- // We need to filter i) those that have been pushed already as stored in
the join,
- // and ii) those that were already in the subtree rooted at child
- ImmutableList<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
- child, valids);
- return toPush;
+ // We need to filter:
+ // i) those that have been pushed already as stored in the join,
+ // ii) those that were already in the subtree rooted at child.
+ List<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+ // If we run the rule in conservative mode, we also filter:
+ // iii) predicates that are not safe for transitive inference.
+ //
+ // There is no formal definition of safety for predicate inference, only
an empirical one.
+ // An unsafe predicate in this context is one that when pushed across join
operands, can lead
+ // to redundant predicates that cannot be simplified (by means of
predicates merging with other existing ones).
+ // This situation can lead to an OOM for cases where lack of
simplification allows inferring new predicates
+ // (from LHS to RHS and vice-versa) recursively, predicates which are
redundant, but that RexSimplify cannot handle.
+ // This notion can be relaxed as soon as RexSimplify gets more powerful,
and it can handle such cases.
+ if (HiveConf.getBoolVar(conf,
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {
+ toPush = toPush.stream()
+ .filter(unsafeOperatorsFinder::isSafe)
+ .collect(Collectors.toList());
+ }
+
+ return ImmutableList.copyOf(toPush);
}
- private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex,
RelDataType rType) {
- RexNode typeSafeRex = rex;
- if ((typeSafeRex instanceof RexCall) &&
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
- RexBuilder rb = cluster.getRexBuilder();
- List<RexNode> fixedPredElems = new ArrayList<RexNode>();
- RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
- RexUtil.types(((RexCall) rex).getOperands()));
- for (RexNode rn : ((RexCall) rex).getOperands()) {
- fixedPredElems.add(rb.ensureType(commonType, rn, true));
- }
+ //~ Inner Classes ----------------------------------------------------------
+
+ /**
+ * Finds unsafe operators in an expression (at any level of nesting).
+ * At the moment, the only unsafe operator is OR.
+ *
+ * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level
expression)
+ * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+ * this is equivalent to OR((<>($0, $1), IS NULL($2))
+ * Example 3: AND(OR(=($0, $1), IS NOT NULL($2)))) (OR in inner expression)
+ */
+ private static class UnsafeOperatorsFinder extends RexVisitorImpl<Void> {
Review Comment:
If in the future we find out that we want to cover more use-cases we can
revise the naming.
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -143,28 +138,82 @@ private ImmutableList<RexNode>
getValidPreds(RelOptCluster cluster, RelNode chil
}
}
- // We need to filter i) those that have been pushed already as stored in
the join,
- // and ii) those that were already in the subtree rooted at child
- ImmutableList<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude,
- child, valids);
- return toPush;
+ // We need to filter:
+ // i) those that have been pushed already as stored in the join,
+ // ii) those that were already in the subtree rooted at child.
+ List<RexNode> toPush =
HiveCalciteUtil.getPredsNotPushedAlready(predicatesToExclude, child, valids);
+
+ // If we run the rule in conservative mode, we also filter:
+ // iii) predicates that are not safe for transitive inference.
+ //
+ // There is no formal definition of safety for predicate inference, only
an empirical one.
+ // An unsafe predicate in this context is one that when pushed across join
operands, can lead
+ // to redundant predicates that cannot be simplified (by means of
predicates merging with other existing ones).
+ // This situation can lead to an OOM for cases where lack of
simplification allows inferring new predicates
+ // (from LHS to RHS and vice-versa) recursively, predicates which are
redundant, but that RexSimplify cannot handle.
+ // This notion can be relaxed as soon as RexSimplify gets more powerful,
and it can handle such cases.
+ if (HiveConf.getBoolVar(conf,
HiveConf.ConfVars.HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE)) {
+ toPush = toPush.stream()
+ .filter(unsafeOperatorsFinder::isSafe)
+ .collect(Collectors.toList());
+ }
+
+ return ImmutableList.copyOf(toPush);
}
- private RexNode getTypeSafePred(RelOptCluster cluster, RexNode rex,
RelDataType rType) {
- RexNode typeSafeRex = rex;
- if ((typeSafeRex instanceof RexCall) &&
HiveCalciteUtil.isComparisonOp((RexCall) typeSafeRex)) {
- RexBuilder rb = cluster.getRexBuilder();
- List<RexNode> fixedPredElems = new ArrayList<RexNode>();
- RelDataType commonType = cluster.getTypeFactory().leastRestrictive(
- RexUtil.types(((RexCall) rex).getOperands()));
- for (RexNode rn : ((RexCall) rex).getOperands()) {
- fixedPredElems.add(rb.ensureType(commonType, rn, true));
- }
+ //~ Inner Classes ----------------------------------------------------------
+
+ /**
+ * Finds unsafe operators in an expression (at any level of nesting).
+ * At the moment, the only unsafe operator is OR.
+ *
+ * Example 1: OR(=($0, $1), IS NOT NULL($2))):INTEGER (OR in the top-level
expression)
+ * Example 2: NOT(AND(=($0, $1), IS NOT NULL($2))
+ * this is equivalent to OR((<>($0, $1), IS NULL($2))
+ * Example 3: AND(OR(=($0, $1), IS NOT NULL($2)))) (OR in inner expression)
+ */
+ private static class UnsafeOperatorsFinder extends RexVisitorImpl<Void> {
+ // accounting for DeMorgan's law
+ boolean inNegation = false;
+ boolean isSafe = true;
+
+ protected UnsafeOperatorsFinder() {
+ super(true);
+ }
- typeSafeRex = rb.makeCall(((RexCall) typeSafeRex).getOperator(),
fixedPredElems);
+ @Override
+ public Void visitCall(RexCall call) {
+ switch (call.getKind()) {
+ case OR:
+ if (inNegation) {
+ return super.visitCall(call);
+ } else {
+ this.isSafe = false;
+ return null;
+ }
+ case AND:
+ if (inNegation) {
+ this.isSafe = false;
+ return null;
+ } else {
+ return super.visitCall(call);
+ }
+ case NOT:
+ inNegation = !inNegation;
+ return super.visitCall(call);
+ default:
+ return super.visitCall(call);
+ }
}
- return typeSafeRex;
+ boolean isSafe(RexNode node) {
Review Comment:
Consider making this method a utility by moving it in `HiveCalciteUtil`
e.g., `HiveCalciteUtil#containsDisjunction`. Also maybe it would be safer to
make the class disposable, i.e., instantiate only inside the method and then
discard. We are going to spend some extra resources for instantiation but I
guess the overhead is negligible.
##########
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:
##########
@@ -2547,6 +2547,9 @@ public static enum ConfVars {
"If this config is true only pushed down filters remain in the
operator tree, \n" +
"and the original filter is removed. If this config is false, the
original filter \n" +
"is also left in the operator tree at the original place."),
+
HIVE_JOIN_PUSH_TRANSITIVE_PREDICATES_CONSERVATIVE("hive.optimize.join.transitive.predicates.conservative",
+ false, "Whether to avoid pushing predicates that are hard to simplify.
\n"
Review Comment:
I am not sure if we want to keep this so general or focus on what actually
happens, which is forbidding disjunction pushdown (e.g.,
`hive.optimize.join.disjunctive.transitive.predicates.pushdown`).
Regarding the default value maybe it would be better to disallow the
optimization by default. From my perspective, correctness is more important
than performance so having HS2 crashing with OOM is more serious than having
perf regression in some queries.
Both of the comments above are rather personal preferences so I am not
pushing hard for making these changes. The final decision is up to you :)
##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveJoinPushTransitivePredicatesRule.java:
##########
@@ -65,21 +66,15 @@
*/
public class HiveJoinPushTransitivePredicatesRule extends RelOptRule {
- public static final HiveJoinPushTransitivePredicatesRule INSTANCE_JOIN =
- new HiveJoinPushTransitivePredicatesRule(HiveJoin.class,
HiveRelFactories.HIVE_FILTER_FACTORY);
-
- public static final HiveJoinPushTransitivePredicatesRule INSTANCE_SEMIJOIN =
- new HiveJoinPushTransitivePredicatesRule(HiveSemiJoin.class,
HiveRelFactories.HIVE_FILTER_FACTORY);
-
- public static final HiveJoinPushTransitivePredicatesRule INSTANCE_ANTIJOIN =
- new HiveJoinPushTransitivePredicatesRule(HiveAntiJoin.class,
HiveRelFactories.HIVE_FILTER_FACTORY);
-
+ private final HiveConf conf;
private final FilterFactory filterFactory;
+ private final UnsafeOperatorsFinder unsafeOperatorsFinder = new
UnsafeOperatorsFinder();
public HiveJoinPushTransitivePredicatesRule(Class<? extends Join> clazz,
- FilterFactory filterFactory) {
+ FilterFactory filterFactory, HiveConf conf) {
Review Comment:
I think the rule is only meant to work with `HiveFilter` so I guess it is
safe to remove this parameterization from the constructor. I wouldn't consider
CBO rules as public API that should be consumed by clients.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]