[GitHub] [iceberg] Fokko commented on pull request #7831: Compute parquet stats

via GitHub Tue, 13 Jun 2023 14:06:03 -0700


Fokko commented on PR #7831:
URL: https://github.com/apache/iceberg/pull/7831#issuecomment-1589945848


   Hey @maxdebayser thanks for raising this PR. I had something different in 
mind with the issue. The main problem here is that we pull a lot of data 
through Python, which brings in a lot of issues (the binary conversion that's 
probably expensive, and limited scalability due to the GIL).
   
   I just did a quick stab (and should have done that sooner), and found the 
following:
   
   ```python
   ➜  Desktop python3 
   Python 3.11.3 (main, Apr  7 2023, 20:13:31) [Clang 14.0.0 
(clang-1400.0.29.202)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow as pa
   >>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
   ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
   ...                              "Brittle stars", "Centipede"]})
   >>> metadata_collector = []
   >>> import pyarrow.parquet as pq
   >>> pq.write_to_dataset(
   ...     table, '/tmp/table',
   ...      metadata_collector=metadata_collector)
   >>> metadata_collector
   [<pyarrow._parquet.FileMetaData object at 0x11f955850>
     created_by: parquet-cpp-arrow version 11.0.0
     num_columns: 2
     num_rows: 6
     num_row_groups: 1
     format_version: 1.0
     serialized_size: 0]
   
   >>> metadata_collector[0].row_group(0)
   <pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
     num_columns: 2
     num_rows: 6
     total_byte_size: 256
   
   >>> metadata_collector[0].row_group(0).to_dict()
   {
        'num_columns': 2,
        'num_rows': 6,
        'total_byte_size': 256,
        'columns': [{
                'file_offset': 119,
                'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
                'physical_type': 'INT64',
                'num_values': 6,
                'path_in_schema': 'n_legs',
                'is_stats_set': True,
                'statistics': {
                        'has_min_max': True,
                        'min': 2,
                        'max': 100,
                        'null_count': 0,
                        'distinct_count': 0,
                        'num_values': 6,
                        'physical_type': 'INT64'
                },
                'compression': 'SNAPPY',
                'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
                'has_dictionary_page': True,
                'dictionary_page_offset': 4,
                'data_page_offset': 46,
                'total_compressed_size': 115,
                'total_uncompressed_size': 117
        }, {
                'file_offset': 359,
                'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
                'physical_type': 'BYTE_ARRAY',
                'num_values': 6,
                'path_in_schema': 'animal',
                'is_stats_set': True,
                'statistics': {
                        'has_min_max': True,
                        'min': 'Brittle stars',
                        'max': 'Parrot',
                        'null_count': 0,
                        'distinct_count': 0,
                        'num_values': 6,
                        'physical_type': 'BYTE_ARRAY'
                },
                'compression': 'SNAPPY',
                'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
                'has_dictionary_page': True,
                'dictionary_page_offset': 215,
                'data_page_offset': 302,
                'total_compressed_size': 144,
                'total_uncompressed_size': 139
        }]
   }
   ```
   
   I think it is much better to retrieve the min-max from there. This is done 
by PyArrow and is probably much faster than when we do it in Python. I really 
would like to stay away from processing data as much as possible.
   
   I think the complexity here is to map it back to the original columns (and 
it gets tricky I think with nested columns).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] Fokko commented on pull request #7831: Compute parquet stats

Reply via email to