haziqishere opened a new pull request, #49926:
URL: https://github.com/apache/arrow/pull/49926

   ### Rationale for this change
   `ColumnChunkMetaData.to_dict()` method omits `bloom_filter_offset` and 
`bloom_filter_length` even when a bloom filter is written to the Parquet file. 
Users cannot programmatically verify bloom filter presence via the Python 
metadata API without resorting to file size comparison.
   
   ### What changes are included in this PR?
   
   1. `python/pyarrow/includes/libparquet.pxd`: Declare `bloom_filter_offset()` 
and `bloom_filter_length()` (both optional[int64_t]) on `CColumnChunkMetaData`. 
This is to expose the existing C++ methods to Cython.
   2. `python/pyarrow/_parquet.pyx`: Add `bloom_filter_offset` and 
`bloom_filter_length` properties to `ColumnChunkMetaData` (returns int when 
set, None otherwise). Add both fields to `to_dict()` and `__repr__`.
   3. `python/pyarrow/tests/parquet/test_metadata.py`: Add 
`test_bloom_filter_offset_in_metadata` verifying that columns with a bloom 
filter expose non-None integer values and that `to_dict()` contains the keys, 
while columns without a bloom filter return None.    
   
   ### Are these changes tested?
   Yes. `test_bloom_filter_offset_in_metadata` in test_metadata.py covers:     
   
   - Column with bloom filter: bloom_filter_offset and bloom_filter_length are 
non-None integers
   - Column without bloom filter: both return None
   - Both keys present in to_dict() output
   
   <img width="863" height="215" alt="image" 
src="https://github.com/user-attachments/assets/d465d6bd-55d1-4c5f-9f11-6a60b3bf1cbe";
 />
   
   Here is closer look on the logic output:
   
   <img width="464" height="424" alt="image" 
src="https://github.com/user-attachments/assets/6e4810f2-c1c0-41ea-b559-00f99d42e2c4";
 />
   
   output:
   ```python
   col_a bloom_filter_offset: 10699
   col_a bloom_filter_length: 1040
   col_b bloom_filter_offset: None
   col_b bloom_filter_length: None
   
   col_a to_dict(): {'file_offset': 0, 'file_path': '', 'physical_type': 
'BYTE_ARRAY', 'num_values': 1000, 'path_in_schema': 'a', 'is_stats_set': True, 
'statistics': {'has_min_max': True, 'min': 'id_0', 'max': 'id_999', 
'null_count': 0, 'distinct_count': None, 'num_values': 1000, 'physical_type': 
'BYTE_ARRAY'}, 'geo_statistics': None, 'compression': 'SNAPPY', 'encodings': 
('PLAIN', 'RLE', 'RLE_DICTIONARY'), 'has_dictionary_page': True, 
'dictionary_page_offset': 4, 'data_page_offset': 4035, 'total_compressed_size': 
5336, 'total_uncompressed_size': 11208, 'bloom_filter_offset': 10699, 
'bloom_filter_length': 1040}
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to