The GitHub Actions job "Python CI Docs" on iceberg-python.git/feat/add-files-duckdb has failed. Run started by GitHub user GabrielAmazonas (triggered by GabrielAmazonas).
Head commit for run: dd2e8a5320c9039d7abffc1be899e12dadd47b72 / Gabriel Amazonas <[email protected]> Fix: Handle bytes values in string column statistics from Parquet Problem: When using `add_files()` with Parquet files written by DuckDB, PyIceberg fails with `AttributeError: 'bytes' object has no attribute 'encode'`. Root Cause: The Parquet format stores column statistics (min_value, max_value) as binary data in the Statistics struct (see parquet.thrift). When PyArrow reads these statistics from Parquet files, it may return them as Python `bytes` objects rather than decoded `str` values. This is valid per the Parquet specification: struct Statistics { 5: optional binary max_value; 6: optional binary min_value; } PyIceberg's StatsAggregator expected string statistics to always be `str`, causing failures when processing Parquet files from writers like DuckDB that expose this binary representation. Fix: 1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by decoding to UTF-8 string before truncation and serialization. 2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to decode bytes values before processing (was raising ValueError). 3. In `to_bytes()` for StringType: Add defensive isinstance check to handle bytes values as a safety fallback. 4. Add unit tests for both StatsAggregator bytes handling and to_bytes. Report URL: https://github.com/apache/iceberg-python/actions/runs/21562438238 With regards, GitHub Actions via GitBox
