sumedhsakdeo opened a new pull request, #3044:
URL: https://github.com/apache/iceberg-python/pull/3044
Closes partially #3036
## Summary
- Forward `batch_size` parameter to PyArrow's `ds.Scanner.from_fragment()`
to control rows per RecordBatch
- Propagated through `_task_to_record_batches` →
`_record_batches_from_scan_tasks_and_deletes` → `ArrowScan.to_record_batches` →
`DataScan.to_arrow_batch_reader`
## PR Stack
This is PR 1 of 3 for #3036:
1. **PR 0 (this)**: `batch_size` forwarding
2. **PR 1**: `streaming` flag — stop materializing entire files
3. **PR 2**: `concurrent_files` — bounded concurrent streaming
# Rationale for this change
Forward batch_size parameter to PyArrow Scanner so users can control row
count per RecordBatch, enabling finer-grained memory control for large files.
## Are these changes tested?
Yes — unit tests for batch_size=100 and batch_size=None in test_pyarrow.py.
## Are there any user-facing changes?
Yes — new batch_size param on to_arrow_batch_reader().
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]