GayathriSrividya opened a new pull request, #3444: URL: https://github.com/apache/iceberg-python/pull/3444
<!-- Closes #3260 --> ## Rationale Iceberg treats Arrow dictionary encoding as an encoding detail rather than a separate logical type. However, `ArrowScan.to_table` currently concatenates batches without decoding dictionary-encoded columns first. A table containing both plain strings and dictionary-encoded strings therefore fails to scan with `ArrowTypeError: Unable to merge`. This can occur in production when files written with dictionary encoding are later rewritten by Athena or Trino optimization into plain strings. ## Changes - Recursively unwrap Arrow dictionary types while preserving unrelated Arrow types and schema metadata. - Normalize dictionary-encoded batches before permissive concatenation in `ArrowScan.to_table`. - Add regression coverage for mixed plain/dictionary string batches and nested dictionary types. ## Attribution I checked the issue and PR history before opening this PR. I did not find an earlier PR or implementation to cherry-pick for #3260. ## Verification - `make lint` - `make test` (`3711 passed, 1534 deselected`) - `uv run python -m pytest tests/io/test_pyarrow.py -q -k "mixed_dictionary or ensure_non_dictionary"` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
