wjones127 commented on issue #15042:
URL: https://github.com/apache/arrow/issues/15042#issuecomment-1368108369
This is very odd. It might have more do with something odd in unify
dictionaries than Parquet, but not 100% sure. If we create the two chunks as
slices of the same original dictionary array, it works fine:
```python
import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})
arr = pa.array(["rusty", "sean", "aa", "zzz", "frank"]).dictionary_encode()
arr_1 = arr.slice(0, 3)
arr_2 = arr.slice(3, 5)
t = pa.Table.from_batches(
[
pa.record_batch([arr_1], names=["field_1"]),
pa.record_batch([arr_2], names=["field_1"]),
]
)
with pq.ParquetWriter("example.parquet", schema) as writer:
writer.write_table(t)
metadata = pq.ParquetFile("example.parquet").metadata
print(f"Has {metadata.num_row_groups} row groups")
stats = metadata.row_group(0).column(0).statistics
print(stats)
```
outputs
```
Has 1 row groups
<pyarrow._parquet.Statistics object at 0x11f76fb50>
has_min_max: True
min: aa
max: zzz
null_count: 0
distinct_count: 0
num_values: 5
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]