GabrielAmazonas opened a new pull request, #2995:
URL: https://github.com/apache/iceberg-python/pull/2995
Problem:
When using `add_files()` with Parquet files written by DuckDB, PyIceberg
fails with `AttributeError: 'bytes' object has no attribute 'encode'`.
Root Cause:
The Parquet format stores column statistics (min_value, max_value) as binary
data in the Statistics struct (see parquet.thrift). When PyArrow reads these
statistics from Parquet files, it may return them as Python `bytes` objects
rather than decoded `str` values. This is valid per the Parquet specification:
struct Statistics {
5: optional binary max_value;
6: optional binary min_value;
}
PyIceberg's StatsAggregator expected string statistics to always be `str`,
causing failures when processing Parquet files from writers like DuckDB that
expose this binary representation.
Fix:
1. In `StatsAggregator.min_as_bytes()`: Add handling for bytes values by
decoding to UTF-8 string before truncation and serialization.
2. In `StatsAggregator.max_as_bytes()`: Update existing string handling to
decode bytes values before processing (was raising ValueError).
3. In `to_bytes()` for StringType: Add defensive isinstance check to handle
bytes values as a safety fallback.
4. Add unit tests for both StatsAggregator bytes handling and to_bytes.
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
## Are these changes tested?
## Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label. -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]