scovich opened a new issue, #9451:
URL: https://github.com/apache/arrow-rs/issues/9451

   **Describe the bug**
   
   The parquet 
[Statistics::null_count_opt](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html#method.null_count_opt)
 method returns Some(0) when the underlying stats are missing. 
   
   This makes the nullcount stats we report actively dangerous -- not merely 
useless -- when attempting to prune row groups based on parquet stats:
   * Consider the predicate `x IS NULL`
   * Any footer with nullcount stat `Some(0)` for x (no nulls) can be safely 
skipped, because no row can satisfy the predicate
   * Any footer with nullcount stat `None` for x (no nullcount stat) _must not_ 
be skipped, because any number of rows could satisfy the predicate
   
   Because we force `None` to `Some(0)` when [decoding null count 
stats](https://docs.rs/parquet/latest/src/parquet/file/statistics.rs.html#122-131),
 it is impossible for a reader to safely use the resulting stat -- even if the 
writer correctly distinguished `None` vs. `Some(0)`!
   
   **To Reproduce**
   
   Read null count stats (or, just read the source code linked above).
   
   **Expected behavior**
   
   The reader should return whatever the writer produced. If a sub-optimal 
writer produced None when it could have produced Some(0), that just means the 
stat is unavailable -- the whole point of an _opt method. Callers who really 
want to conflate the two are welcome to `unwrap_or(0)`. At least then the 
intent is stated clearly.
   
   **Additional context**
   
   The current behavior is actually documented... but gives no justification 
for the behavior nor any hint of warning about the ramifications:
   > Note this API returns Some(0) even if the null count was not present in 
the statistics. See https://github.com/apache/arrow-rs/pull/6216/files
   
   Worse, 
[ValueStatistics::null_count_opt](https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html#method.null_count_opt)
 does _not_ include the same note, which tricked at least one library user into 
thinking that the problem is somehow local to the `Statistics::null_count_opt` 
wrapper method rather than built into the stats decoding itself:
   
https://github.com/delta-io/delta-kernel-rs/pull/362/changes#diff-1ca4255de9858e955b9c478a0358de966b3add53ec1ec3e448b26533d0d4e97bR187-R199
   
   Meanwhile, on the only PR 
[comment](https://github.com/apache/arrow-rs/pull/6216#discussion_r1716983500) 
that mentions `Some(0)` seems to deal with the _write_ path:
   > The previous API treated null_count = 0 as None.
   > The new API treats null_count = 0 as Some(0).
   
   ... which apparently is a problem because the parquet writer still encodes 
`Some(0)` as `None`? 
   * https://github.com/apache/arrow-rs/issues/6256
   
   ... but the issue description there makes a false claim :
   > This will not cause issues for people using parquet-rs to read and write 
data as the reader also (incorrectly) reports `Some(0)` when the thrift 
metadata has `None`
   
   That incorrect reader behavior _IS_, in fact, a problem: 
   * In the days before `null_count_opt`, the value `0` could mean "no nulls" 
but it could also mean "no nullcount stat". 
   * So readers could not safely perform stats-based row group pruning for an 
IS [NOT] NULL predicate (because "no nulls" and "no nullcount stat" have 
opposite meanings).
   * Now that `null_count_opt` exists, we _should_ be able to distinguish the 
two scenarios as `Some(0)` vs. `None`. But the reader undoes that distinction 
with an `unwrap_or(0)` in `from_thrift_page_stats`
   * Presumably that was a misguided attempt to preserve... something... but in 
practice it just means null count stats continue to be dangerous (not merely 
useless). 
   
   Note that writers and readers have different safety rules here:
   * It's correct (albeit pessimal) for a writer to not produce a stat. It just 
means readers won't have as much information to work with later.
   * It's dangerously incorrect for a reader to infer Some(0) when encountering 
None. See example above.
   
   Do we know of any writer that emits `Some(0)` where it should have produced 
`None`? All the commentary I could find (see additional context below) suggests 
the opposite problem -- that writers emit `None` when they could/should have 
emitted `Some(0)`. If that's true, then we should still honor the None vs. 
Some(0) distinction on the read path when it exists, and let writers catch up 
when they catch up.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to