[PR] PR0: forward batch_size parameter to PyArrow Scanner [iceberg-python]

via GitHub Sat, 14 Feb 2026 14:14:58 -0800


sumedhsakdeo opened a new pull request, #3044:
URL: https://github.com/apache/iceberg-python/pull/3044


   Closes partially #3036
    
   ## Summary
     - Forward `batch_size` parameter to PyArrow's `ds.Scanner.from_fragment()` 
to control rows per RecordBatch
     - Propagated through `_task_to_record_batches` → 
`_record_batches_from_scan_tasks_and_deletes` → `ArrowScan.to_record_batches` → 
`DataScan.to_arrow_batch_reader`
   
     ## PR Stack
     This is PR 1 of 3 for #3036:
     1. **PR 0 (this)**: `batch_size` forwarding
     2. **PR 1**: `streaming` flag — stop materializing entire files
     3. **PR 2**: `concurrent_files` — bounded concurrent streaming
    
   # Rationale for this change
   
   Forward batch_size parameter to PyArrow Scanner so users can control row 
count per RecordBatch, enabling finer-grained memory control for large files.
   
   ## Are these changes tested?
   
   Yes — unit tests for batch_size=100 and batch_size=None in test_pyarrow.py.
   
   ## Are there any user-facing changes?
   
   Yes — new batch_size param on to_arrow_batch_reader().


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] PR0: forward batch_size parameter to PyArrow Scanner [iceberg-python]

Reply via email to