Github user zellerh commented on a diff in the pull request:

    https://github.com/apache/trafodion/pull/1530#discussion_r182912167
  
    --- Diff: core/sql/optimizer/NormRelExpr.cpp ---
    @@ -2826,18 +2827,39 @@ RelExpr* Join::transformSemiJoin(NormWA& normWARef)
     
      /* Apply the transformation described in item b) above.
        The transformation below is done if there are no non-equijoin preds 
either 
    -  and the inner side has no base tables (i.e. is an IN LIST) or if we have
    -  used a CQD to turn this transformation on for a specific user. For the 
general
    -  case we are not certain if this transformation is always beneficial, so 
it is 
    -  not on by default */
    +  and the inner side has no base tables (i.e. is an IN LIST) OR if the 
groupby
    +  is expected to provide a reduction > 
SEMIJOIN_TO_INNERJOIN_REDUCTION_RATIO
    +  (default is 5.0) OR the inner row count is small OR if we have used a 
CQD to 
    +  turn this transformation on. Some rationale: A data reduction might 
reduce
    +  the amount of data for the inner table of a hash join (or it might not!
    +  hash-semi-join sometimes does duplicate elimination itself, but not 
always).
    +  Converting to a join allows the join to be commuted; if the number of 
rows
    +  is small, nested join might be profitably chosen in that case. */
     
           ValueIdSet preds ;
           preds += joinPred();
           preds += selectionPred();
           preds -= getEquiJoinPredicates() ;
     
    +      EstLogPropSharedPtr innerEstLogProp = 
child(1)->getGroupAttr()->outputLogProp((*GLOBAL_EMPTY_INPUT_LOGPROP));
    +      CostScalar innerRowCount = innerEstLogProp->getResultCardinality(); 
    +      CostScalar innerUec = 
innerEstLogProp->getAggregateUec(equiJoinCols1);
    --- End diff --
    
    One question about this: This UEC can be very unreliable, if we don't have 
multi-column statistics. So, could it happen that we have a semi-join with a 
very large inner table, where our UEC is underestimated? The extra groupby 
could be a bad choice in that situation.
    
    On the other hand, I think we usually overestimate multi-column UECs, so 
maybe this is not so much of a concern.


---

Reply via email to