linliu-code opened a new pull request, #18792:
URL: https://github.com/apache/hudi/pull/18792

   ### Change Logs
   
   This PR fixes two related correctness bugs in the column-stats metadata 
table that cause silent zero-row query results when data-skipping is enabled:
   
   - **#18754** — A single `NaN` value in a `FLOAT`/`DOUBLE` column corrupts 
the persisted col-stats record for that file. parquet-mr clears the underlying 
parquet stats (`hasNonNullValue=false`) but leaves min/max at their default 
`0.0`. Hudi was reading those sentinel values and persisting `min=0.0, max=0.0` 
to the MDT. The 1.x read path then used those fabricated bounds, pruning the 
file at the partition-stats level and dropping rows from any predicate 
involving that partition — including predicates on completely unrelated columns 
of files that contain no NaN at all.
   
   - **#18755** — When a Parquet column omits stats entirely (e.g. one value 
exceeds parquet-mr's stats truncation threshold ~1KB for binary/string), the 
col-stats writer was mis-recording `nullCount = valueCount`, i.e. "all rows in 
this file are null". Any non-null predicate then silently skips that file. The 
existing comment at the bug site intended to handle "column contains only null 
values" but conflated it with the "stats absent" case.
   
   Both bugs cause silent data loss at query time — no error, no warning, just 
an incomplete result set.
   
   ### Root cause
   
   `hudi-hadoop-common/.../ParquetUtils.readColumnStatsFromMetadata` did two 
things that turned out to be unsafe in the presence of unreliable parquet stats:
   
   1. It called `stats.genericGetMin()/getMax()` without checking 
`stats.hasNonNullValue()`. parquet-mr uses that flag specifically to signal 
"min/max are unreliable, do not trust them" — most commonly hit when any NaN 
appears in a `DOUBLE`/`FLOAT` column.
   2. It did `stats.isEmpty() ? columnChunkMetaData.getValueCount() : 
stats.getNumNulls()`. `isEmpty()` is `true` both when the column is genuinely 
all-null and when parquet omitted stats entirely. The two cases need opposite 
handling.
   
   The error then propagated downstream: even after `ParquetUtils` was fixed to 
report null min/max for unreliable stats, the per-partition aggregator 
`HoodieColumnRangeMetadata.merge` silently dropped the null contribution from 
the unreliable file. This left `PartitionStatsIndexSupport.prunePartitions` 
with a partial bound that pruned the partition based on the *other* files' 
stats — which is the silent zero-row symptom in the #18754 reproducer.
   
   The read-side filter generator in `DataSkippingUtils` also needed updating: 
a predicate like `colA_maxValue > X` evaluates to null when the file has null 
max stat, and Spark treats null as false in `WHERE`, so the file gets pruned by 
the data-skipping index filter. Wrapping with `OR colA_minValue IS NULL` makes 
it scanned instead.
   
   ### Changes
   
   - `ParquetUtils.readColumnStatsFromMetadata` — gate min/max on 
`hasNonNullValue()`; gate `nullCount = valueCount` on `isNumNullsSet()`.
   - `HoodieTableMetadataUtil.collectColumnRangeMetadata` — skip NaN values in 
the streaming min/max accumulator (mirrors parquet-mr's `DoubleStatistics` 
policy). `Double.compare(NaN, x)` returns `+1` in Java, so an unguarded 
comparison would let a single NaN poison the running max forever.
   - `HoodieTableMetadataUtil.mergeColumnStatsRecords` — distinguish "all-null 
column" from "stats unreliable" when merging records during MDT compaction.
   - `HoodieColumnRangeMetadata.merge` — the per-file → per-partition merge 
helper used by `PartitionStatsIndexSupport`. Apply the same unreliability 
propagation here so partition-level aggregates reflect the worst-case unknown 
when any contributing file has unreliable stats. **This was the missing piece** 
that kept the read-side bug visible even after the other writer fixes were in 
place.
   - `DataSkippingUtils.scala` — wrap every min/max-using skip predicate (`<`, 
`>`, `<=`, `>=`, EqualTo via `genColumnValuesEqualToExpression`, In, InSet, 
StartsWith, the various `Not(...)` cases) with `OR colMin IS NULL` so a file 
with stats marked unreliable is always scanned rather than pruned by 
null-propagation.
   - `SparkBaseIndexSupport.scala` — in `getCandidateFiles`, augment the 
assembled index filter with a global "any column referenced in the query has 
null min stat" clause. Belt-and-suspenders against any predicate case the 
per-predicate wrap might miss.
   
   ### Impact / Risk
   
   The fixes are strictly more conservative than the prior behavior: files with 
unreliable stats are now scanned instead of silently pruned. Tables that don't 
contain NaN values and don't have stats-truncated columns are unaffected. 
Performance impact for those tables is zero (the new gates short-circuit when 
stats are reliable).
   
   For tables that do hit either condition, queries previously returning 
silently-wrong (incomplete) results now return correct results; pruning is less 
aggressive on the file(s) that triggered the unreliable-stats signal.
   
   ### Impact
   
   Describe any public API changes.
   
   No public API change. Internal write/read code only.
   
   ### Risk level
   
   Medium. The merge-helper change (`HoodieColumnRangeMetadata.merge`) is in a 
path shared by partition-stats aggregation and any other caller that merges 
per-file ranges; the semantics shift from "drop null" to "propagate null when 
stats are unreliable, drop null when column is genuinely all-null". The 
distinguishing check is `nullCount == valueCount`, which is the existing 
convention for the all-null sentinel — so no API change, only behavior 
alignment with the documented contract.
   
   ### Documentation Update
   
   No documentation changes required. The bugs were silent regressions against 
the existing contract that "data-skipping should never change query results".
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to