yes, only Mehnaz's approach without naive estimation or any hybrid logic. This is what current patch is also doing.
On 2026/04/15 19:17:28 Mike Carey wrote: > Cool! So to be clear, you are proposing to (only) use Mehnaz's approach > - i.e., to get the per-group counts and use those - fully. +1 to the > proposal if so! > > On 4/15/26 11:07 AM, Preetham Poluparthi wrote: > > Hi, > > I’ve reviewed the patch and the changes look good. > > I introduced the instanceof refactoring, and I plan to replace it with > > enum-based checks in upcoming patches. > > The current group-by estimator relies on the number of groups in the > > group-by clause. Mehnaz’s approach, however, uses the counts within each > > group, which provides a more accurate estimate. That said, there is a > > possibility of incorrect estimation if there is skew in the sampled data. > > I’d prefer moving forward with it fully to avoid the complexity of hybrid > > logic. > > > > - Preetham > > > > On 2026/04/09 17:13:44 Ian Maxon wrote: > >> Hi fellow devs, > >> I'm trying to merge in Mehnaz's improvements for DISTINCT estimation. > >> I was able to squash merge her changes on top of master without too > >> much issue I think. However there are a few areas where the diff is > >> not obvious, or related code was modified simultaneously. I figured I > >> would ask here about these two things, because at least to me, a lot > >> of this code is not familiar. > >> > >> The WIP patch is > >> here:https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/21097 > >> I will start with the simplest one, which I think is just a > >> consequence of refactoring: > >> In JoinNode::addMultiDatasetPlans, we have to trigger the estimation. > >> In the original patch, this is done like so: > >> > >> JoinNode jnL = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[0]]; > >> JoinNode jnR = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[1]]; > >> computeJoinEstDistinctCardinality(jnL, jnR); > >> > >> However a lot of the code around this was refactored to use > >> inheritance and setters instead of directly accessing the field. I > >> refactored it to this after reading some of the related code: > >> > >> if (cheapestPlanNode instanceof JoinPlanNode) { > >> JoinPlanNode cheapestJoinNode = (JoinPlanNode) cheapestPlanNode; > >> computeJoinEstDistinctCardinality(cheapestJoinNode.getLeftPlan().getJoinNode(), > >> cheapestJoinNode.getRightPlan().getJoinNode()); > >> } > >> Which seems to at least work. However using instanceof is typically a > >> sign of larger problems, so I want to be sure that this is the right > >> way. > >> > >> > >> The second issue is less straightforward and requires some particular > >> knowledge. Currently, in Stats::findEstDistinctWithPredicates, we > >> already do some estimation of DISTINCT cardinalities within GROUP BY > >> clauses, from lines 1138 to 1187. It seems like it hinges around > >> whether or not there is only 1 clause or multiple. The difficulty is, > >> none of this new estimation code existed at the time Mehnaz's patch > >> was being written. Therefore I am unsure if we should always use the > >> new distinct estimator, sometimes fall back to the method in the code > >> right now, or use some other synthesis of the two rules. > >> > >> I would appreciate anyone's thoughts on either matter. This is a > >> really neat piece of work and I don't want it to get lost by simply > >> not being merged when it works and the code is there. > >> > >> - Ian > >>
