GayathriSrividya opened a new pull request, #3461:
URL: https://github.com/apache/iceberg-python/pull/3461

   Closes #3170
   
   ## Rationale
   
   Columns that contain large or frequently repeated string values (e.g. JSON 
blobs, low-cardinality categoricals) can exhaust memory when PyArrow loads them 
as plain string arrays. PyArrow's Parquet reader natively supports 
dictionary-encoded reads via its `dictionary_columns` kwarg, which deduplicates 
values and can dramatically reduce peak memory usage.
   
   This was previously discussed in #3168 and a prior implementation (#3234) 
was closed as stale.
   
   ## Changes
   
   - Added `dictionary_columns: tuple[str, ...] = ()` to `Table.scan()`, 
`TableScan.__init__`, and `StagedTable.scan()`.
   - Forwarded through `DataScan.to_arrow()` and `to_arrow_batch_reader()` → 
`ArrowScan.__init__` → `_task_to_record_batches` → `_get_file_format()`.
   - Only applied when `task.file.file_format == FileFormat.PARQUET`; silently 
ignored for ORC (which does not support this kwarg).
   
   ## Usage
   
   ```python
   # Read the "payload" column as dictionary-encoded to save memory
   df = table.scan(dictionary_columns=("payload",)).to_arrow()
   ```
   
   ## Verification
   
   - Added `test_dictionary_columns_produces_dict_encoded_output` — confirms 
the requested column is dict-encoded, non-requested columns are plain, and 
values are identical.
   - `make lint` ✓
   - `pytest tests/table/ tests/io/test_pyarrow.py` ✓


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to