The GitHub Actions job "Python CI Docs" on 
iceberg-python.git/feat/add-files-duckdb has failed.
Run started by GitHub user GabrielAmazonas (triggered by GabrielAmazonas).

Head commit for run:
dd2e8a5320c9039d7abffc1be899e12dadd47b72 / Gabriel Amazonas 
<[email protected]>
Fix: Handle bytes values in string column statistics from Parquet

Problem:
When using `add_files()` with Parquet files written by DuckDB, PyIceberg
fails with `AttributeError: 'bytes' object has no attribute 'encode'`.

Root Cause:
The Parquet format stores column statistics (min_value, max_value) as binary
data in the Statistics struct (see parquet.thrift). When PyArrow reads these
statistics from Parquet files, it may return them as Python `bytes` objects
rather than decoded `str` values. This is valid per the Parquet specification:

  struct Statistics {
    5: optional binary max_value;
    6: optional binary min_value;
  }

PyIceberg's StatsAggregator expected string statistics to always be `str`,
causing failures when processing Parquet files from writers like DuckDB that
expose this binary representation.

Fix:
1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by
   decoding to UTF-8 string before truncation and serialization.

2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to
   decode bytes values before processing (was raising ValueError).

3. In `to_bytes()` for StringType: Add defensive isinstance check to handle
   bytes values as a safety fallback.

4. Add unit tests for both StatsAggregator bytes handling and to_bytes.

Report URL: https://github.com/apache/iceberg-python/actions/runs/21562438238

With regards,
GitHub Actions via GitBox

Reply via email to