Re: DISTINCT estimation questions

Mike Carey Wed, 15 Apr 2026 12:18:15 -0700

Cool! So to be clear, you are proposing to (only) use Mehnaz's approach- i.e., to get the per-group counts and use those - fully. +1 to theproposal if so!


On 4/15/26 11:07 AM, Preetham Poluparthi wrote:

Hi,
I’ve reviewed the patch and the changes look good.
I introduced the instanceof refactoring, and I plan to replace it with 
enum-based checks in upcoming patches.
The current group-by estimator relies on the number of groups in the group-by 
clause. Mehnaz’s approach, however, uses the counts within each group, which 
provides a more accurate estimate. That said, there is a possibility of 
incorrect estimation if there is skew in the sampled data.
I’d prefer moving forward with it fully to avoid the complexity of hybrid logic.


- Preetham

On 2026/04/09 17:13:44 Ian Maxon wrote:

Hi fellow devs,
I'm trying to merge in Mehnaz's improvements for DISTINCT estimation.
I was able to squash merge her changes on top of master without too
much issue I think. However there are a few areas where the diff is
not obvious, or related code was modified simultaneously. I figured I
would ask here about these two things, because at least to me, a lot
of this code is not familiar.

The WIP patch is here:https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/21097
I will start with the simplest one, which I think is just a
consequence of refactoring:
In JoinNode::addMultiDatasetPlans, we have to trigger the estimation.
In the original patch, this is done like so:

JoinNode jnL = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[0]];
JoinNode jnR = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[1]];
computeJoinEstDistinctCardinality(jnL, jnR);

However a lot of the code around this was refactored to use
inheritance and setters instead of directly accessing the field. I
refactored it to this after reading some of the related code:

if (cheapestPlanNode instanceof JoinPlanNode) {
JoinPlanNode cheapestJoinNode = (JoinPlanNode) cheapestPlanNode;
computeJoinEstDistinctCardinality(cheapestJoinNode.getLeftPlan().getJoinNode(),
cheapestJoinNode.getRightPlan().getJoinNode());
}
Which seems to at least work. However using instanceof is typically a
sign of larger problems, so I want to be sure that this is the right
way.


The second issue is less straightforward and requires some particular
knowledge. Currently, in Stats::findEstDistinctWithPredicates, we
already do some estimation of DISTINCT cardinalities within GROUP BY
clauses, from lines 1138 to 1187. It seems like it hinges around
whether or not there is only 1 clause or multiple. The difficulty is,
none of this new estimation code existed at the time Mehnaz's patch
was being written. Therefore I am unsure if we should always use the
new distinct estimator, sometimes fall back to the method in the code
right now, or use some other synthesis of the two rules.

I would appreciate anyone's thoughts on either matter. This is a
really neat piece of work and I don't want it to get lost by simply
not being merged when it works and the code is there.

- Ian

Re: DISTINCT estimation questions

Reply via email to