Fokko opened a new issue, #36068:
URL: https://github.com/apache/arrow/issues/36068

   ### Describe the enhancement requested
   
   Iceberg relies on statistics (called Metrics in Iceberg) to speed up the 
queries. Most of the metrics are available and can be easily extracted using 
the MetadataCollector, except for the NaN counts. If someone does an `isNaN` 
expression on a FLOAT/DOUBLE field, Iceberg tries to skip Parquet files by 
looking at the metrics that it has stored in the manifest files. It would be 
awesome if next to `null_count` also `nan_count` can be added:
   
   ```python
   ➜  Desktop python3 
   Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 
(clang-1400.0.29.202)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow as pa
   >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
   ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
   ...                              "Brittle stars", "Centipede"]})
   >>> metadata_collector = []
   >>> import pyarrow.parquet as pq
   >>> pq.write_to_dataset(
   ...     table, '/tmp/table',
   ...      metadata_collector=metadata_collector)
   >>> metadata_collector
   [<pyarrow._parquet.FileMetaData object at 0x11f955850>
     created_by: parquet-cpp-arrow version 11.0.0
     num_columns: 2
     num_rows: 6
     num_row_groups: 1
     format_version: 1.0
     serialized_size: 0]
   
   >>> metadata_collector[0].row_group(0)
   <pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
     num_columns: 2
     num_rows: 6
     total_byte_size: 256
   
   >>> metadata_collector[0].row_group(0).to_dict()
   {
        'num_columns': 2,
        'num_rows': 6,
        'total_byte_size': 256,
        'columns': [{
                'file_offset': 119,
                'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
                'physical_type': 'INT64',
                'num_values': 6,
                'path_in_schema': 'n_legs',
                'is_stats_set': True,
                'statistics': {
                        'has_min_max': True,
                        'min': 2,
                        'max': 100,
                        'null_count': 0,
                        'distinct_count': 0,
                        'num_values': 6,
                        'physical_type': 'INT64'
                },
                'compression': 'SNAPPY',
                'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
                'has_dictionary_page': True,
                'dictionary_page_offset': 4,
                'data_page_offset': 46,
                'total_compressed_size': 115,
                'total_uncompressed_size': 117
        }, {
                'file_offset': 359,
                'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
                'physical_type': 'BYTE_ARRAY',
                'num_values': 6,
                'path_in_schema': 'animal',
                'is_stats_set': True,
                'statistics': {
                        'has_min_max': True,
                        'min': 'Brittle stars',
                        'max': 'Parrot',
                        'null_count': 0,
                        'distinct_count': 0,
                        'num_values': 6,
                        'physical_type': 'BYTE_ARRAY'
                },
                'compression': 'SNAPPY',
                'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
                'has_dictionary_page': True,
                'dictionary_page_offset': 215,
                'data_page_offset': 302,
                'total_compressed_size': 144,
                'total_uncompressed_size': 139
        }]
   }
   ```
   
   In addition to this, Parquet itself is also looking into this: 
https://github.com/apache/parquet-format/pull/196
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to