zclllyybb commented on issue #64122:
URL: https://github.com/apache/doris/issues/64122#issuecomment-4622731256
Breakwater-GitHub-Analysis-Slot: slot_ffafea503c14
I checked the current code path on upstream master and the 3.0 / 4.0 / 4.1
branches. This report looks like a real sample-statistics validation bug, but
the fix should avoid persisting a contradictory `ndv=0` + non-null min/max
statistic as-is.
What is confirmed from the code:
- `OlapAnalysisTask.doSample()` first calls `collectMinMax()`, and that
`BASIC_STATS_TEMPLATE` scans the table/index without the sample tablet hint. So
min/max can come from the full table.
- The sampled stats SQL then uses `${rowCount}` from the OLAP table/index
row-count metadata, while `null_count` is computed from sampled rows and
scaled, for example `ROUND(SUM(CASE WHEN col IS NULL THEN 1 ELSE 0 END) *
${scaleFactor})`.
- `BaseAnalysisTask.runQuery()` builds `ColStatsData` and calls
`ColStatsData.isValid()` unconditionally, without knowing whether the row came
from full analyze or sample analyze.
- `ColStatsData.isValid()` then applies the full-stat invariant `ndv == 0 &&
min/max is not all null && nullCount != count`, which is not valid for this
mixed-provenance sampled row.
So the failing shape is plausible: a sampled OLAP analyze can have
full-table min/max proving at least one non-NULL value exists, while the
sampled NDV path can still produce `0`, and the scaled `null_count` is not an
exact value that must equal the metadata `count`. Rejecting the whole collected
statistic at this point matches the warning and the fallback to unknown/default
column stats.
Suggested fix direction:
- Split validation by statistic provenance: keep the strict invariant for
full analyze rows where count/null-count/ndv/min/max come from the same scan,
but do not use `nullCount != count` as a hard rejection condition for sampled
rows.
- For sampled rows where full-table min/max is non-null but sampled NDV is
`0`, normalize the representation before caching/persisting it, for example by
clamping NDV to at least `1` or by clearing min/max for that sampled-zero-NDV
case. A blind bypass is risky because `StatsCalculator.checkNdvValidation()`
also treats `ndv == 0 && min/max != null` as invalid and can disable join
reorder later.
- Add a regression test for an OLAP sample analyze where rare non-NULL
values are missed or under-estimated by the sample while full-table min/max is
non-null.
Information that would still help validate the final patch:
- Exact Doris build/commit for the reported 3.x/4.x deployments.
- The `ANALYZE` statement or auto-analyze settings, especially sample
rows/percent and whether tablet sampling or `LIMIT` was used.
- A small reproducible table layout: partition/tablet count, approximate row
counts, and where the rare non-NULL values are located.
- FE logs around the analyze job, plus the generated analyze SQL if debug
logging is available.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]