haziqishere opened a new pull request, #49926: URL: https://github.com/apache/arrow/pull/49926
### Rationale for this change `ColumnChunkMetaData.to_dict()` method omits `bloom_filter_offset` and `bloom_filter_length` even when a bloom filter is written to the Parquet file. Users cannot programmatically verify bloom filter presence via the Python metadata API without resorting to file size comparison. ### What changes are included in this PR? 1. `python/pyarrow/includes/libparquet.pxd`: Declare `bloom_filter_offset()` and `bloom_filter_length()` (both optional[int64_t]) on `CColumnChunkMetaData`. This is to expose the existing C++ methods to Cython. 2. `python/pyarrow/_parquet.pyx`: Add `bloom_filter_offset` and `bloom_filter_length` properties to `ColumnChunkMetaData` (returns int when set, None otherwise). Add both fields to `to_dict()` and `__repr__`. 3. `python/pyarrow/tests/parquet/test_metadata.py`: Add `test_bloom_filter_offset_in_metadata` verifying that columns with a bloom filter expose non-None integer values and that `to_dict()` contains the keys, while columns without a bloom filter return None. ### Are these changes tested? Yes. `test_bloom_filter_offset_in_metadata` in test_metadata.py covers: - Column with bloom filter: bloom_filter_offset and bloom_filter_length are non-None integers - Column without bloom filter: both return None - Both keys present in to_dict() output <img width="863" height="215" alt="image" src="https://github.com/user-attachments/assets/d465d6bd-55d1-4c5f-9f11-6a60b3bf1cbe" /> Here is closer look on the logic output: <img width="464" height="424" alt="image" src="https://github.com/user-attachments/assets/6e4810f2-c1c0-41ea-b559-00f99d42e2c4" /> output: ```python col_a bloom_filter_offset: 10699 col_a bloom_filter_length: 1040 col_b bloom_filter_offset: None col_b bloom_filter_length: None col_a to_dict(): {'file_offset': 0, 'file_path': '', 'physical_type': 'BYTE_ARRAY', 'num_values': 1000, 'path_in_schema': 'a', 'is_stats_set': True, 'statistics': {'has_min_max': True, 'min': 'id_0', 'max': 'id_999', 'null_count': 0, 'distinct_count': None, 'num_values': 1000, 'physical_type': 'BYTE_ARRAY'}, 'geo_statistics': None, 'compression': 'SNAPPY', 'encodings': ('PLAIN', 'RLE', 'RLE_DICTIONARY'), 'has_dictionary_page': True, 'dictionary_page_offset': 4, 'data_page_offset': 4035, 'total_compressed_size': 5336, 'total_uncompressed_size': 11208, 'bloom_filter_offset': 10699, 'bloom_filter_length': 1040} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
