GayathriSrividya opened a new pull request, #3461:
URL: https://github.com/apache/iceberg-python/pull/3461
Closes #3170
## Rationale
Columns that contain large or frequently repeated string values (e.g. JSON
blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them
as plain string arrays. PyArrow's Parquet reader natively supports
dictionary-encoded reads via its `dictionary_columns` kwarg, which deduplicates
values and can dramatically reduce peak memory usage.
This was previously discussed in #3168 and a prior implementation (#3234)
was closed as stale.
## Changes
- Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`,
`TableScan.__init__`, and `StagedTable.scan()`.
- Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()` →
`ArrowScan.__init__` → `_task_to_record_batches` → `_get_file_format()`.
- Only applied when `task.file.file_format == FileFormat.PARQUET`; silently
ignored for ORC (which does not support this kwarg).
## Usage
```python
# Read the "payload" column as dictionary-encoded to save memory
df = table.scan(dictionary_columns=("payload",)).to_arrow()
```
## Verification
- Added `test_dictionary_columns_produces_dict_encoded_output` — confirms
the requested column is dict-encoded, non-requested columns are plain, and
values are identical.
- `make lint` ✓
- `pytest tests/table/ tests/io/test_pyarrow.py` ✓
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]