[GitHub] [arrow] wjones127 commented on issue #15042: [Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet

GitBox Fri, 30 Dec 2022 14:21:16 -0800


wjones127 commented on issue #15042:
URL: https://github.com/apache/arrow/issues/15042#issuecomment-1368108369


   This is very odd. It might have more do with something odd in unify 
dictionaries than Parquet, but not 100% sure. If we create the two chunks as 
slices of the same original dictionary array, it works fine:
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})
   
   arr = pa.array(["rusty", "sean", "aa", "zzz", "frank"]).dictionary_encode()
   arr_1 = arr.slice(0, 3)
   arr_2 = arr.slice(3, 5)
   
   t = pa.Table.from_batches(
       [
           pa.record_batch([arr_1], names=["field_1"]),
           pa.record_batch([arr_2], names=["field_1"]),
       ]
   )
   
   
   with pq.ParquetWriter("example.parquet", schema) as writer:
       writer.write_table(t)
   
   metadata = pq.ParquetFile("example.parquet").metadata
   print(f"Has {metadata.num_row_groups} row groups")
   stats = metadata.row_group(0).column(0).statistics
   print(stats)
   ```
   
   outputs
   
   ```
   Has 1 row groups
   <pyarrow._parquet.Statistics object at 0x11f76fb50>
     has_min_max: True
     min: aa
     max: zzz
     null_count: 0
     distinct_count: 0
     num_values: 5
     physical_type: BYTE_ARRAY
     logical_type: String
     converted_type (legacy): UTF8
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wjones127 commented on issue #15042: [Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet

Reply via email to