[parquet-testing] branch master updated: Add file with NaN in statistics (#35)

apitrou Mon, 30 Jan 2023 02:22:13 -0800

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git



The following commit(s) were added to refs/heads/master by this push:
     new 33b4e23  Add file with NaN in statistics (#35)
33b4e23 is described below

commit 33b4e23376c28e489c6a08b9207829b29e4bffb8
Author: Will Jones <willjones...@gmail.com>
AuthorDate: Mon Jan 30 02:21:59 2023 -0800

    Add file with NaN in statistics (#35)
    
    * add file with NaN in statistics
    
    * copy down relevant rules into readme
    
    * add more explination
---
 data/README.md            |  58 ++++++++++++++++++++++++++++++++++++++++++++++
 data/nan_in_stats.parquet | Bin 0 -> 329 bytes
 2 files changed, 58 insertions(+)

diff --git a/data/README.md b/data/README.md
index b2c5128..072a9d5 100644
--- a/data/README.md
+++ b/data/README.md
@@ -40,6 +40,7 @@
 | overflow_i16_page_cnt.parquet                  | row group with more than 
INT16_MAX pages                   |
 | bloom_filter.bin                               | deprecated bloom filter 
binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with 
thrift header and xxhash hashing    |
+| nan_in_stats.parquet                           | statistics contains NaN in 
max, from PyArrow 0.8.0. See note below on "NaN in stats".  |
 
 TODO: Document what each file is in the table above.
 
@@ -117,3 +118,60 @@ 
https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15
 
 `bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of
 
https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33.
+
+## NaN in stats
+
+Prior to version 1.4.0, the C++ Parquet writer would write NaN values in min 
and
+max statistics. (Correction in [this 
issue](https://issues.apache.org/jira/browse/PARQUET-1225)).
+It has been updated since to ignore NaN values when calculating
+statistics, but for backwards compatibility the following rules were 
established
+(in [PARQUET-1222](https://github.com/apache/parquet-format/pull/185)):
+
+> For backwards compatibility when reading files:
+> * If the min is a NaN, it should be ignored.
+> * If the max is a NaN, it should be ignored.
+> * If the min is +0, the row group may contain -0 values as well.
+> * If the max is -0, the row group may contain +0 values as well.
+> * When looking for NaN values, min and max should be ignored.
+
+The file `nan_in_stats.parquet` was generated with:
+
+```python
+import pyarrow as pa # version 0.8.0
+import pyarrow.parquet as pq
+from numpy import NaN
+
+tab = pa.Table.from_arrays(
+    [pa.array([1.0, NaN])],
+    names="x"
+)
+
+pq.write_table(tab, "nan_in_stats.parquet")
+
+metadata = pq.read_metadata("nan_in_stats.parquet")
+metadata.row_group(0).column(0)
+# <pyarrow._parquet.ColumnChunkMetaData object at 0x7f28539e58f0>
+#   file_offset: 88
+#   file_path: 
+#   type: DOUBLE
+#   num_values: 2
+#   path_in_schema: x
+#   is_stats_set: True
+#   statistics:
+#     <pyarrow._parquet.RowGroupStatistics object at 0x7f28539e5738>
+#       has_min_max: True
+#       min: 1
+#       max: nan
+#       null_count: 0
+#       distinct_count: 0
+#       num_values: 2
+#       physical_type: DOUBLE
+#   compression: 1
+#   encodings: <map object at 0x7f28539eb4e0>
+#   has_dictionary_page: True
+#   dictionary_page_offset: 4
+#   data_page_offset: 36
+#   index_page_offset: 0
+#   total_compressed_size: 84
+#   total_uncompressed_size: 80
+```
diff --git a/data/nan_in_stats.parquet b/data/nan_in_stats.parquet
new file mode 100755
index 0000000..28b4044
Binary files /dev/null and b/data/nan_in_stats.parquet differ

[parquet-testing] branch master updated: Add file with NaN in statistics (#35)

Reply via email to