haziqishere opened a new issue, #49927:
URL: https://github.com/apache/arrow/issues/49927
### Describe the bug, including details regarding any error messages,
version, and platform.
**Describe the bug**
When writing a Parquet file with `bloom_filter_options` via
`pyarrow.parquet.write_table`, the bloom filter is correctly written to disk
(verified via file size delta matching SBBF sizing formula), but the bloom
filter offset and length are not reflected in `ColumnChunkMetaData.to_dict()`.
The keys `bloom_filter_offset` and `bloom_filter_length` are absent from the
returned dict, making it impossible to programmatically verify bloom filter
presence via the Python metadata API.
**pyarrow version:** 24.0.0
---
I've propose a fix for this issue in #49926
**To Reproduce Error**
```python
import os
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({
"event_id": [f"id_{i:06d}" for i in range(10_000)],
"value": [f"data_{i}" for i in range(10_000)],
})
# Write without bloom filter
pq.write_table(table, "/tmp/no_bloom.parquet")
# Write with bloom filter
pq.write_table(
table,
"/tmp/with_bloom.parquet",
bloom_filter_options={"event_id": {"fpp": 0.05, "ndv": 10_000}},
)
# File size confirms bloom filter is written
size_no = os.path.getsize("/tmp/no_bloom.parquet")
size_with = os.path.getsize("/tmp/with_bloom.parquet")
print(f"Without bloom: {size_no:,} bytes")
print(f"With bloom: {size_with:,} bytes")
print(f"Difference: {size_with - size_no:,} bytes") # ~16KB confirming
bloom written
# But metadata does not reflect it
pf = pq.ParquetFile("/tmp/with_bloom.parquet")
col = pf.metadata.row_group(0).column(0) # event_id column
print(col.to_dict())
# Expected: dict contains 'bloom_filter_offset' and 'bloom_filter_length'
# Actual: neither key is present
```
**Expected output**
```
Without bloom: 135,278 bytes
With bloom: 151,687 bytes
Difference: +16,409 bytes
{
'path_in_schema': 'event_id',
...
'bloom_filter_offset': <some_int>,
'bloom_filter_length': <some_int>, # if exposed
...
}
```
**Actual output**
```
Without bloom: 135,278 bytes
With bloom: 151,687 bytes
Difference: +16,409 bytes
{
'file_offset': 0,
'file_path': '',
'physical_type': 'BYTE_ARRAY',
'num_values': 10000,
'path_in_schema': 'event_id',
'is_stats_set': True,
'statistics': {...},
'compression': 'SNAPPY',
'encodings': ('PLAIN', 'RLE', 'RLE_DICTIONARY'),
'has_dictionary_page': True,
'dictionary_page_offset': 4,
'data_page_offset': 49,
'total_compressed_size': 96,
'total_uncompressed_size': 93
# bloom_filter_offset is absent
}
```
---
**Additional context**
The `bloom_filter_offset` field exists in the underlying Parquet Thrift
`ColumnMetaData` spec and is populated when a bloom filter is written. The file
size delta between the two files matches the expected SBBF sizing for
`ndv=10_000, fpp=0.05` confirming the bloom filter data is physically present
in the file.
The `ColumnChunkMetaData` C++ class exposes `bloom_filter_offset()` — it
would be useful to have this surfaced in the Python `to_dict()` output so users
can programmatically verify bloom filter presence without resorting to file
size comparison.
Tested variants of `bloom_filter_options` and their observed file size
deltas confirming bloom filters are written in all cases:
```
bool_true → ndv=1,048,576 (default) → +1,048,603 bytes
empty_dict → ndv=1,048,576 (default) → +1,048,603 bytes
fpp_only → ndv=1,048,576 (default) → +1,048,603 bytes
ndv_only → ndv=10,000 → +16,409 bytes
fpp_and_ndv → ndv=10,000 → +16,409 bytes
tight_fpp → fpp=0.001, ndv=10,000 → +32,793 bytes
both_cols → 2 columns, default ndv → +2,097,207 bytes (~2×)
```
Related: #49376 (added `bloom_filter_options` to `write_table` in 24.0.0)
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]