[PR] feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads [iceberg-python]

via GitHub Fri, 05 Jun 2026 00:12:51 -0700


GayathriSrividya opened a new pull request, #3461:
URL: https://github.com/apache/iceberg-python/pull/3461


   Closes #3170
   
   ## Rationale
   
   Columns that contain large or frequently repeated string values (e.g. JSON 
blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them 
as plain string arrays. PyArrow's Parquet reader natively supports 
dictionary-encoded reads via its `dictionary_columns` kwarg, which deduplicates 
values and can dramatically reduce peak memory usage.
   
   This was previously discussed in #3168 and a prior implementation (#3234) 
was closed as stale.
   
   ## Changes
   
   - Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`, 
`TableScan.__init__`, and `StagedTable.scan()`.
   - Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()` → 
`ArrowScan.__init__` → `_task_to_record_batches` → `_get_file_format()`.
   - Only applied when `task.file.file_format == FileFormat.PARQUET`; silently 
ignored for ORC (which does not support this kwarg).
   
   ## Usage
   
   ```python
   # Read the "payload" column as dictionary-encoded to save memory
   df = table.scan(dictionary_columns=("payload",)).to_arrow()
   ```
   
   ## Verification
   
   - Added `test_dictionary_columns_produces_dict_encoded_output` — confirms 
the requested column is dict-encoded, non-requested columns are plain, and 
values are identical.
   - `make lint` ✓
   - `pytest tests/table/ tests/io/test_pyarrow.py` ✓


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: add dictionary_columns parameter to Table.scan() for memory-efficient reads [iceberg-python]

Reply via email to