Fokko commented on PR #7831:
URL: https://github.com/apache/iceberg/pull/7831#issuecomment-1589945848
Hey @maxdebayser thanks for raising this PR. I had something different in
mind with the issue. The main problem here is that we pull a lot of data
through Python, which brings in a lot of issues (the binary conversion that's
probably expensive, and limited scalability due to the GIL).
I just did a quick stab (and should have done that sooner), and found the
following:
```python
➜ Desktop python3
Python 3.11.3 (main, Apr 7 2023, 20:13:31) [Clang 14.0.0
(clang-1400.0.29.202)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> table = pa.table({'n_legs': [2, 2, 4, 4, 5, 100],
... 'animal': ["Flamingo", "Parrot", "Dog", "Horse",
... "Brittle stars", "Centipede"]})
>>> metadata_collector = []
>>> import pyarrow.parquet as pq
>>> pq.write_to_dataset(
... table, '/tmp/table',
... metadata_collector=metadata_collector)
>>> metadata_collector
[<pyarrow._parquet.FileMetaData object at 0x11f955850>
created_by: parquet-cpp-arrow version 11.0.0
num_columns: 2
num_rows: 6
num_row_groups: 1
format_version: 1.0
serialized_size: 0]
>>> metadata_collector[0].row_group(0)
<pyarrow._parquet.RowGroupMetaData object at 0x105837d80>
num_columns: 2
num_rows: 6
total_byte_size: 256
>>> metadata_collector[0].row_group(0).to_dict()
{
'num_columns': 2,
'num_rows': 6,
'total_byte_size': 256,
'columns': [{
'file_offset': 119,
'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
'physical_type': 'INT64',
'num_values': 6,
'path_in_schema': 'n_legs',
'is_stats_set': True,
'statistics': {
'has_min_max': True,
'min': 2,
'max': 100,
'null_count': 0,
'distinct_count': 0,
'num_values': 6,
'physical_type': 'INT64'
},
'compression': 'SNAPPY',
'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
'has_dictionary_page': True,
'dictionary_page_offset': 4,
'data_page_offset': 46,
'total_compressed_size': 115,
'total_uncompressed_size': 117
}, {
'file_offset': 359,
'file_path': 'c569c5eaf90c4395885f31e012068b69-0.parquet',
'physical_type': 'BYTE_ARRAY',
'num_values': 6,
'path_in_schema': 'animal',
'is_stats_set': True,
'statistics': {
'has_min_max': True,
'min': 'Brittle stars',
'max': 'Parrot',
'null_count': 0,
'distinct_count': 0,
'num_values': 6,
'physical_type': 'BYTE_ARRAY'
},
'compression': 'SNAPPY',
'encodings': ('PLAIN_DICTIONARY', 'PLAIN', 'RLE'),
'has_dictionary_page': True,
'dictionary_page_offset': 215,
'data_page_offset': 302,
'total_compressed_size': 144,
'total_uncompressed_size': 139
}]
}
```
I think it is much better to retrieve the min-max from there. This is done
by PyArrow and is probably much faster than when we do it in Python. I really
would like to stay away from processing data as much as possible.
I think the complexity here is to map it back to the original columns (and
it gets tricky I think with nested columns).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]