[PR] fix: normalize dictionary types in Arrow scans [iceberg-python]

via GitHub Sat, 30 May 2026 06:39:01 -0700


GayathriSrividya opened a new pull request, #3444:
URL: https://github.com/apache/iceberg-python/pull/3444


   <!-- Closes #3260 -->
   
   ## Rationale
   Iceberg treats Arrow dictionary encoding as an encoding detail rather than a 
separate logical type. However, `ArrowScan.to_table` currently concatenates 
batches without decoding dictionary-encoded columns first. A table containing 
both plain strings and dictionary-encoded strings therefore fails to scan with 
`ArrowTypeError: Unable to merge`.
   
   This can occur in production when files written with dictionary encoding are 
later rewritten by Athena or Trino optimization into plain strings.
   
   ## Changes
   - Recursively unwrap Arrow dictionary types while preserving unrelated Arrow 
types and schema metadata.
   - Normalize dictionary-encoded batches before permissive concatenation in 
`ArrowScan.to_table`.
   - Add regression coverage for mixed plain/dictionary string batches and 
nested dictionary types.
   
   ## Attribution
   I checked the issue and PR history before opening this PR. I did not find an 
earlier PR or implementation to cherry-pick for #3260.
   
   ## Verification
   - `make lint`
   - `make test` (`3711 passed, 1534 deselected`)
   - `uv run python -m pytest tests/io/test_pyarrow.py -q -k "mixed_dictionary 
or ensure_non_dictionary"`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: normalize dictionary types in Arrow scans [iceberg-python]

Reply via email to