scovich opened a new issue, #9451: URL: https://github.com/apache/arrow-rs/issues/9451
**Describe the bug** The parquet [Statistics::null_count_opt](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html#method.null_count_opt) method returns Some(0) when the underlying stats are missing. This makes the nullcount stats we report actively dangerous -- not merely useless -- when attempting to prune row groups based on parquet stats: * Consider the predicate `x IS NULL` * Any footer with nullcount stat `Some(0)` for x (no nulls) can be safely skipped, because no row can satisfy the predicate * Any footer with nullcount stat `None` for x (no nullcount stat) _must not_ be skipped, because any number of rows could satisfy the predicate Because we force `None` to `Some(0)` when [decoding null count stats](https://docs.rs/parquet/latest/src/parquet/file/statistics.rs.html#122-131), it is impossible for a reader to safely use the resulting stat -- even if the writer correctly distinguished `None` vs. `Some(0)`! **To Reproduce** Read null count stats (or, just read the source code linked above). **Expected behavior** The reader should return whatever the writer produced. If a sub-optimal writer produced None when it could have produced Some(0), that just means the stat is unavailable -- the whole point of an _opt method. Callers who really want to conflate the two are welcome to `unwrap_or(0)`. At least then the intent is stated clearly. **Additional context** The current behavior is actually documented... but gives no justification for the behavior nor any hint of warning about the ramifications: > Note this API returns Some(0) even if the null count was not present in the statistics. See https://github.com/apache/arrow-rs/pull/6216/files Worse, [ValueStatistics::null_count_opt](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#method.null_count_opt) does _not_ include the same note, which tricked at least one library user into thinking that the problem is somehow local to the `Statistics::null_count_opt` wrapper method rather than built into the stats decoding itself: https://github.com/delta-io/delta-kernel-rs/pull/362/changes#diff-1ca4255de9858e955b9c478a0358de966b3add53ec1ec3e448b26533d0d4e97bR187-R199 Meanwhile, on the only PR [comment](https://github.com/apache/arrow-rs/pull/6216#discussion_r1716983500) that mentions `Some(0)` seems to deal with the _write_ path: > The previous API treated null_count = 0 as None. > The new API treats null_count = 0 as Some(0). ... which apparently is a problem because the parquet writer still encodes `Some(0)` as `None`? * https://github.com/apache/arrow-rs/issues/6256 ... but the issue description there makes a false claim : > This will not cause issues for people using parquet-rs to read and write data as the reader also (incorrectly) reports `Some(0)` when the thrift metadata has `None` That incorrect reader behavior _IS_, in fact, a problem: * In the days before `null_count_opt`, the value `0` could mean "no nulls" but it could also mean "no nullcount stat". * So readers could not safely perform stats-based row group pruning for an IS [NOT] NULL predicate (because "no nulls" and "no nullcount stat" have opposite meanings). * Now that `null_count_opt` exists, we _should_ be able to distinguish the two scenarios as `Some(0)` vs. `None`. But the reader undoes that distinction with an `unwrap_or(0)` in `from_thrift_page_stats` * Presumably that was a misguided attempt to preserve... something... but in practice it just means null count stats continue to be dangerous (not merely useless). Note that writers and readers have different safety rules here: * It's correct (albeit pessimal) for a writer to not produce a stat. It just means readers won't have as much information to work with later. * It's dangerously incorrect for a reader to infer Some(0) when encountering None. See example above. Do we know of any writer that emits `Some(0)` where it should have produced `None`? All the commentary I could find (see additional context below) suggests the opposite problem -- that writers emit `None` when they could/should have emitted `Some(0)`. If that's true, then we should still honor the None vs. Some(0) distinction on the read path when it exists, and let writers catch up when they catch up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
