Github user zellerh commented on a diff in the pull request:
https://github.com/apache/trafodion/pull/1530#discussion_r182912167
--- Diff: core/sql/optimizer/NormRelExpr.cpp ---
@@ -2826,18 +2827,39 @@ RelExpr* Join::transformSemiJoin(NormWA& normWARef)
/* Apply the transformation described in item b) above.
The transformation below is done if there are no non-equijoin preds
either
- and the inner side has no base tables (i.e. is an IN LIST) or if we have
- used a CQD to turn this transformation on for a specific user. For the
general
- case we are not certain if this transformation is always beneficial, so
it is
- not on by default */
+ and the inner side has no base tables (i.e. is an IN LIST) OR if the
groupby
+ is expected to provide a reduction >
SEMIJOIN_TO_INNERJOIN_REDUCTION_RATIO
+ (default is 5.0) OR the inner row count is small OR if we have used a
CQD to
+ turn this transformation on. Some rationale: A data reduction might
reduce
+ the amount of data for the inner table of a hash join (or it might not!
+ hash-semi-join sometimes does duplicate elimination itself, but not
always).
+ Converting to a join allows the join to be commuted; if the number of
rows
+ is small, nested join might be profitably chosen in that case. */
ValueIdSet preds ;
preds += joinPred();
preds += selectionPred();
preds -= getEquiJoinPredicates() ;
+ EstLogPropSharedPtr innerEstLogProp =
child(1)->getGroupAttr()->outputLogProp((*GLOBAL_EMPTY_INPUT_LOGPROP));
+ CostScalar innerRowCount = innerEstLogProp->getResultCardinality();
+ CostScalar innerUec =
innerEstLogProp->getAggregateUec(equiJoinCols1);
--- End diff --
One question about this: This UEC can be very unreliable, if we don't have
multi-column statistics. So, could it happen that we have a semi-join with a
very large inner table, where our UEC is underestimated? The extra groupby
could be a bad choice in that situation.
On the other hand, I think we usually overestimate multi-column UECs, so
maybe this is not so much of a concern.
---