Hi fellow devs, I'm trying to merge in Mehnaz's improvements for DISTINCT estimation. I was able to squash merge her changes on top of master without too much issue I think. However there are a few areas where the diff is not obvious, or related code was modified simultaneously. I figured I would ask here about these two things, because at least to me, a lot of this code is not familiar.
The WIP patch is here: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/21097 I will start with the simplest one, which I think is just a consequence of refactoring: In JoinNode::addMultiDatasetPlans, we have to trigger the estimation. In the original patch, this is done like so: JoinNode jnL = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[0]]; JoinNode jnR = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[1]]; computeJoinEstDistinctCardinality(jnL, jnR); However a lot of the code around this was refactored to use inheritance and setters instead of directly accessing the field. I refactored it to this after reading some of the related code: if (cheapestPlanNode instanceof JoinPlanNode) { JoinPlanNode cheapestJoinNode = (JoinPlanNode) cheapestPlanNode; computeJoinEstDistinctCardinality(cheapestJoinNode.getLeftPlan().getJoinNode(), cheapestJoinNode.getRightPlan().getJoinNode()); } Which seems to at least work. However using instanceof is typically a sign of larger problems, so I want to be sure that this is the right way. The second issue is less straightforward and requires some particular knowledge. Currently, in Stats::findEstDistinctWithPredicates, we already do some estimation of DISTINCT cardinalities within GROUP BY clauses, from lines 1138 to 1187. It seems like it hinges around whether or not there is only 1 clause or multiple. The difficulty is, none of this new estimation code existed at the time Mehnaz's patch was being written. Therefore I am unsure if we should always use the new distinct estimator, sometimes fall back to the method in the code right now, or use some other synthesis of the two rules. I would appreciate anyone's thoughts on either matter. This is a really neat piece of work and I don't want it to get lost by simply not being merged when it works and the code is there. - Ian
