Hi,
I’ve reviewed the patch and the changes look good.
I introduced the instanceof refactoring, and I plan to replace it with 
enum-based checks in upcoming patches.
The current group-by estimator relies on the number of groups in the group-by 
clause. Mehnaz’s approach, however, uses the counts within each group, which 
provides a more accurate estimate. That said, there is a possibility of 
incorrect estimation if there is skew in the sampled data.
I’d prefer moving forward with it fully to avoid the complexity of hybrid logic.

- Preetham

On 2026/04/09 17:13:44 Ian Maxon wrote:
> Hi fellow devs,
> I'm trying to merge in Mehnaz's improvements for DISTINCT estimation.
> I was able to squash merge her changes on top of master without too
> much issue I think. However there are a few areas where the diff is
> not obvious, or related code was modified simultaneously. I figured I
> would ask here about these two things, because at least to me, a lot
> of this code is not familiar.
> 
> The WIP patch is here: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/21097
> I will start with the simplest one, which I think is just a
> consequence of refactoring:
> In JoinNode::addMultiDatasetPlans, we have to trigger the estimation.
> In the original patch, this is done like so:
> 
> JoinNode jnL = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[0]];
> JoinNode jnR = this.joinEnum.jnArray[this.cheapestPlanNode.jnIndexes[1]];
> computeJoinEstDistinctCardinality(jnL, jnR);
> 
> However a lot of the code around this was refactored to use
> inheritance and setters instead of directly accessing the field. I
> refactored it to this after reading some of the related code:
> 
> if (cheapestPlanNode instanceof JoinPlanNode) {
> JoinPlanNode cheapestJoinNode = (JoinPlanNode) cheapestPlanNode;
> computeJoinEstDistinctCardinality(cheapestJoinNode.getLeftPlan().getJoinNode(),
> cheapestJoinNode.getRightPlan().getJoinNode());
> }
> Which seems to at least work. However using instanceof is typically a
> sign of larger problems, so I want to be sure that this is the right
> way.
> 
> 
> The second issue is less straightforward and requires some particular
> knowledge. Currently, in Stats::findEstDistinctWithPredicates, we
> already do some estimation of DISTINCT cardinalities within GROUP BY
> clauses, from lines 1138 to 1187. It seems like it hinges around
> whether or not there is only 1 clause or multiple. The difficulty is,
> none of this new estimation code existed at the time Mehnaz's patch
> was being written. Therefore I am unsure if we should always use the
> new distinct estimator, sometimes fall back to the method in the code
> right now, or use some other synthesis of the two rules.
> 
> I would appreciate anyone's thoughts on either matter. This is a
> really neat piece of work and I don't want it to get lost by simply
> not being merged when it works and the code is there.
> 
> - Ian
> 

Reply via email to