JigaoLuo opened a new issue, #47955:
URL: https://github.com/apache/arrow/issues/47955

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I encountered a bug while trying to retrieve Parquet metadata for a column 
chunk with logical type `Decimal128(15, 2)`.
   - The Parquet file was generated using arrow-rs, and I can successfully 
access its metadata via `arrow-rs`, `DataFusion`, or this tool: 
https://parquet-viewer.xiangpeng.systems/
   - **However, I run into an error when attempting to read the metadata using 
PyArrow.**
   
   I’ll attach the Parquet file (under 50MB) along with a minimal Python script 
to reproduce the issue. If the bug isn’t reproducible on your end, I’m happy to 
help investigate further.
   
   ```python
   #!/usr/bin/env python3
   # $ python parquet_metadata_reader.py supplier.parquet 
   
   import sys
   import pyarrow.parquet as pq
   
   def print_parquet_metadata(parquet_file):
       pq_metadata = pq.read_metadata(parquet_file)
       schema = pq_metadata.schema.to_arrow_schema()
       for col_idx in range(len(schema)):
           field = schema.field(col_idx)
           col_name = field.name
           column_meta = pq_metadata.schema.column(col_idx)
           print(f"Column {col_idx}: {col_name}")
           print(f"  Type: {column_meta.physical_type}")
           row_group = pq_metadata.row_group(0) # Stats of the first row group
           rg_column = row_group.column(col_idx)
           print("  Stats:", rg_column.statistics)
   
   if __name__ == "__main__":
       if len(sys.argv) != 2:
           print("Usage: python parquet_metadata_reader.py <parquet_file>")
           sys.exit(1)
       try:
           print_parquet_metadata(sys.argv[1])
       except Exception as e:
           print(f"Error: {e}")
           sys.exit(1)%
   ```
   
   The error message:
   ```
   Column 0: c_custkey
     Type: INT64
     Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd60>
     has_min_max: True
     min: 1
     max: 14999999
     null_count: 0
     distinct_count: None
     num_values: 3000188
     physical_type: INT64
     logical_type: None
     converted_type (legacy): NONE
   Column 1: c_nationkey
     Type: INT32
     Stats: <pyarrow._parquet.Statistics object at 0x7f11d4accd10>
     has_min_max: True
     min: 0
     max: 24
     null_count: 0
     distinct_count: None
     num_values: 3000188
     physical_type: INT32
     logical_type: None
     converted_type (legacy): NONE
   Column 2: c_acctbal
     Type: INT64
     Stats: Error: Cannot extract statistics for type 
   ```
   
   Thanks!
   
   
   ## Version
   
   I installed `pyarrow` via conda: 
   
   ```bash
   $ conda list | grep pyarrow
   pyarrow                             21.0.0              py313h78bf25f_1      
         conda-forge
   pyarrow-core                        21.0.0              py313he109ebe_1_cpu  
         conda-forge
   ```
   
   ## Platform
   
   I use bare-metal on CPU `AMD EPYC 7742 64-Core Processor` and Ubuntu from 
NVIDIA `5.15.0-1042-nvidia`
   
   ```bash
   
   $ uname -a
   Linux dgx 5.15.0-1042-nvidia #42-Ubuntu SMP Wed Nov 15 20:28:30 UTC 2023 
x86_64 x86_64 x86_64 GNU/Linux
   ```
   
   ## Related issue (?)
   
   I could only find a similar one, but not exactly the same issue: 
https://github.com/microsoft/semantic-link-labs/issues/909
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to