GabrielAmazonas opened a new pull request, #2995:
URL: https://github.com/apache/iceberg-python/pull/2995

   Problem:
   When using `add_files()` with Parquet files written by DuckDB, PyIceberg 
fails with `AttributeError: 'bytes' object has no attribute 'encode'`.
   
   Root Cause:
   The Parquet format stores column statistics (min_value, max_value) as binary 
data in the Statistics struct (see parquet.thrift). When PyArrow reads these 
statistics from Parquet files, it may return them as Python `bytes` objects 
rather than decoded `str` values. This is valid per the Parquet specification:
   
     struct Statistics {
       5: optional binary max_value;
       6: optional binary min_value;
     }
   
   PyIceberg's StatsAggregator expected string statistics to always be `str`, 
causing failures when processing Parquet files from writers like DuckDB that 
expose this binary representation.
   
   Fix:
   1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by 
decoding to UTF-8 string before truncation and serialization.
   
   2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to 
decode bytes values before processing (was raising ValueError).
   
   3. In `to_bytes()` for StringType: Add defensive isinstance check to handle 
bytes values as a safety fallback.
   
   4. Add unit tests for both StatsAggregator bytes handling and to_bytes.
   
   <!--
   Thanks for opening a pull request!
   -->
   
   <!-- In the case this PR will resolve an issue, please replace 
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
   <!-- Closes #${GITHUB_ISSUE_ID} -->
   
   # Rationale for this change
   
   ## Are these changes tested?
   
   ## Are there any user-facing changes?
   
   <!-- In the case of user-facing changes, please add the changelog label. -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to