Re: DISTINCT estimation questions

Preetham Poluparthi Wed, 15 Apr 2026 12:24:35 -0700

yes, only Mehnaz's approach without naive estimation or any hybrid logic. This 
is what current patch is also doing.


On 2026/04/15 19:17:28 Mike Carey wrote:
> Cool!  So to be clear, you are proposing to (only) use Mehnaz's approach 
> - i.e., to get the per-group counts and use those - fully.  +1 to the 
> proposal if so!
> 
> On 4/15/26 11:07 AM, Preetham Poluparthi wrote:
> > Hi,
> > I’ve reviewed the patch and the changes look good.
> > I introduced the instanceof refactoring, and I plan to replace it with 
> > enum-based checks in upcoming patches.
> > The current group-by estimator relies on the number of groups in the 
> > group-by clause. Mehnaz’s approach, however, uses the counts within each 
> > group, which provides a more accurate estimate. That said, there is a 
> > possibility of incorrect estimation if there is skew in the sampled data.
> > I’d prefer moving forward with it fully to avoid the complexity of hybrid 
> > logic.
> >
> > - Preetham
> >
> > On 2026/04/09 17:13:44 Ian Maxon wrote:
> >> Hi fellow devs,
> >> I'm trying to merge in Mehnaz's improvements for DISTINCT estimation.
> >> I was able to squash merge her changes on top of master without too
> >> much issue I think. However there are a few areas where the diff is
> >> not obvious, or related code was modified simultaneously. I figured I
> >> would ask here about these two things, because at least to me, a lot
> >> of this code is not familiar.
> >>
> >> The WIP patch is 
> >> here:https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/21097
> >> I will start with the simplest one, which I think is just a
> >> consequence of refactoring:
> >> In JoinNode::addMultiDatasetPlans, we have to trigger the estimation.
> >> In the original patch, this is done like so:
> >>
> >> JoinNode jnL = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[0]];
> >> JoinNode jnR = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[1]];
> >> computeJoinEstDistinctCardinality(jnL, jnR);
> >>
> >> However a lot of the code around this was refactored to use
> >> inheritance and setters instead of directly accessing the field. I
> >> refactored it to this after reading some of the related code:
> >>
> >> if (cheapestPlanNode instanceof JoinPlanNode) {
> >> JoinPlanNode cheapestJoinNode = (JoinPlanNode) cheapestPlanNode;
> >> computeJoinEstDistinctCardinality(cheapestJoinNode.getLeftPlan().getJoinNode(),
> >> cheapestJoinNode.getRightPlan().getJoinNode());
> >> }
> >> Which seems to at least work. However using instanceof is typically a
> >> sign of larger problems, so I want to be sure that this is the right
> >> way.
> >>
> >>
> >> The second issue is less straightforward and requires some particular
> >> knowledge. Currently, in Stats::findEstDistinctWithPredicates, we
> >> already do some estimation of DISTINCT cardinalities within GROUP BY
> >> clauses, from lines 1138 to 1187. It seems like it hinges around
> >> whether or not there is only 1 clause or multiple. The difficulty is,
> >> none of this new estimation code existed at the time Mehnaz's patch
> >> was being written. Therefore I am unsure if we should always use the
> >> new distinct estimator, sometimes fall back to the method in the code
> >> right now, or use some other synthesis of the two rules.
> >>
> >> I would appreciate anyone's thoughts on either matter. This is a
> >> really neat piece of work and I don't want it to get lost by simply
> >> not being merged when it works and the code is there.
> >>
> >> - Ian
> >>

Re: DISTINCT estimation questions

Reply via email to