konstantinb commented on PR #6418:
URL: https://github.com/apache/hive/pull/6418#issuecomment-4671800984

   @zabetak, my apologies for not following up on all your questions/comments 
of this PR; I was focused on its alternative solution.
   
   Taking your suggestion from #discussion_r3233225674 — making the const case 
`if (cs.isConst()) ndv = 1` — broke 8 .out files this PR never touched: 
`union7/15/17/19`, `unionDistinct_3`, `tez_union_multiinsert`, 
`explainanalyze_2`, `load_dyn_part14`.
   
   The cause is exactly what you flagged on #6359 when you first saw the field:
   
   > Can we avoid this new indicator? If we add a new field it means that we 
need to keep them up to date in various places where `ColStatistics` are used 
[...]
   
   `isConst` is set at the producers and maintained in the CASE/COALESCE 
combiner (`PessimisticStatCombiner`), but **not** in the UNION combiner 
(`Statistics.addToColumnStats`) — that one recomputes `countDistinct` and 
`numNulls` and leaves `isConst` alone. So a column out of a `UNION` of distinct 
constants still reports `isConst = true`, and the unconditional `ndv = 1` acts 
on a stale flag. There's no single place that keeps the flag consistent across 
every path `ColStatistics` takes — your March objection, made concrete. 
(Maintaining it in `addToColumnStats` too doesn't rescue this: 
`result.isConst() && stat.isConst()` — the rule already in the CASE combiner — 
still classifies a UNION of *distinct* constants as constant, so there's no 
boolean that's correct here without comparing the underlying values, which 
`ColStatistics` doesn't carry.)
   
   That's also why I'd rather take the rest of these together than one at a 
time — they're facets of the same `NDV == 0` overload. The boolean-branch ones 
are locally patchable (and `:832` you rightly flagged as likely unreachable), 
but a local patch just relocates the ambiguity instead of removing it. The one 
that can't be patched around is Part B: distinguishing a genuinely-unknown NDV 
from a verified zero — your funnel_step / device_type case — needs the two 
states to be *different values*, which a boolean const flag can't supply for a 
non-constant column. The sentinel separates them once and the rest follow.
   
   What does fix them is the other direction from that same #6359 thread:
   
   > everything would be simpler if we could use -1 for NDV to declare unknown 
as it happens for the other stats. This is probably a bigger change to digest 
so let's not go into this direction for now.
   
   That's HIVE-29625 / #6505. It's the larger change you flagged as a lot to 
digest — ~17 files, more `.q.out` churn — which is why I tried the flag first. 
But the consistency problem is structural, not a missing case, so I don't think 
the `isConst` path can be made correct. With the -1 treated as "unknown NDV". 
there's no flag to sync: `extractNDVGroupingColumns` becomes just `if (ndv < 0) 
fall back to the heuristic; else +1 for the null bucket`, and your Part B 
underestimation resolves on its own — unknown columns are `-1` and fall back, 
genuinely-all-null columns stay `0` and get the `+1`, so both "sources of 
(NDV=0, numNulls>0)" land right without guessing their frequency.
   
   So I'd like to close this PR (and HIVE-29556) as superseded by HIVE-29625 
and move review to #6505 — the inevitable version of this rather than one more 
attempt at the flag, and the direction you pointed to first. Apologies for the 
detour to get here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to