This is an automated email from the ASF dual-hosted git repository. apitrou pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push: new 33b4e23 Add file with NaN in statistics (#35) 33b4e23 is described below commit 33b4e23376c28e489c6a08b9207829b29e4bffb8 Author: Will Jones <willjones...@gmail.com> AuthorDate: Mon Jan 30 02:21:59 2023 -0800 Add file with NaN in statistics (#35) * add file with NaN in statistics * copy down relevant rules into readme * add more explination --- data/README.md | 58 ++++++++++++++++++++++++++++++++++++++++++++++ data/nan_in_stats.parquet | Bin 0 -> 329 bytes 2 files changed, 58 insertions(+) diff --git a/data/README.md b/data/README.md index b2c5128..072a9d5 100644 --- a/data/README.md +++ b/data/README.md @@ -40,6 +40,7 @@ | overflow_i16_page_cnt.parquet | row group with more than INT16_MAX pages | | bloom_filter.bin | deprecated bloom filter binary with binary header and murmur3 hashing | | bloom_filter.xxhash.bin | bloom filter binary with thrift header and xxhash hashing | +| nan_in_stats.parquet | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats". | TODO: Document what each file is in the table above. @@ -117,3 +118,60 @@ https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15 `bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33. + +## NaN in stats + +Prior to version 1.4.0, the C++ Parquet writer would write NaN values in min and +max statistics. (Correction in [this issue](https://issues.apache.org/jira/browse/PARQUET-1225)). +It has been updated since to ignore NaN values when calculating +statistics, but for backwards compatibility the following rules were established +(in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)): + +> For backwards compatibility when reading files: +> * If the min is a NaN, it should be ignored. +> * If the max is a NaN, it should be ignored. +> * If the min is +0, the row group may contain -0 values as well. +> * If the max is -0, the row group may contain +0 values as well. +> * When looking for NaN values, min and max should be ignored. + +The file `nan_in_stats.parquet` was generated with: + +```python +import pyarrow as pa # version 0.8.0 +import pyarrow.parquet as pq +from numpy import NaN + +tab = pa.Table.from_arrays( + [pa.array([1.0, NaN])], + names="x" +) + +pq.write_table(tab, "nan_in_stats.parquet") + +metadata = pq.read_metadata("nan_in_stats.parquet") +metadata.row_group(0).column(0) +# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0> +# file_offset: 88 +# file_path: +# type: DOUBLE +# num_values: 2 +# path_in_schema: x +# is_stats_set: True +# statistics: +# <pyarrow._parquet.RowGroupStatistics object at 0x7f28539e5738> +# has_min_max: True +# min: 1 +# max: nan +# null_count: 0 +# distinct_count: 0 +# num_values: 2 +# physical_type: DOUBLE +# compression: 1 +# encodings: <map object at 0x7f28539eb4e0> +# has_dictionary_page: True +# dictionary_page_offset: 4 +# data_page_offset: 36 +# index_page_offset: 0 +# total_compressed_size: 84 +# total_uncompressed_size: 80 +``` diff --git a/data/nan_in_stats.parquet b/data/nan_in_stats.parquet new file mode 100755 index 0000000..28b4044 Binary files /dev/null and b/data/nan_in_stats.parquet differ