Re: [PR] feat: Add streaming and concurrent file reads to ArrowScan to reduce memory usage and increase throughput [iceberg-python]

via GitHub Mon, 16 Feb 2026 19:37:07 -0800


sumedhsakdeo commented on code in PR #3046:
URL: https://github.com/apache/iceberg-python/pull/3046#discussion_r2814826867



##########
mkdocs/docs/api.md:
##########
@@ -355,6 +355,36 @@ for buf in tbl.scan().to_arrow_batch_reader():
     print(f"Buffer contains {len(buf)} rows")
 ```
 
+You can control the number of rows per batch using the `batch_size` parameter:
+
+```python
+for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
+    print(f"Buffer contains {len(buf)} rows")
+```
+
+By default, each file's batches are materialized in memory before being 
yielded. For large files that may exceed available memory, use `streaming=True` 
to yield batches as they are produced without materializing entire files:
+
+```python
+for buf in tbl.scan().to_arrow_batch_reader(streaming=True, batch_size=1000):
+    print(f"Buffer contains {len(buf)} rows")
+```
+
+For maximum throughput, use `concurrent_files` to read multiple files in 
parallel while streaming. Batches are yielded as they arrive from any file — 
ordering across files is not guaranteed:

Review Comment:
   I am not sure if we can guarantee deterministic input order to DDP. For 
example, our tables go through continuous optimization and files are rewritten. 
So two training runs on the same unchanged table (pre- and post-compaction) 
will yield different file scan tasks.
   
   Regardless, the API supports two modes you are referring to:
     - streaming=False (default) → deterministic task order, high memory — 
batches grouped by file in submission order
     - streaming=True → arrival order — interleaved, no ordering guarantee



##########
mkdocs/docs/api.md:
##########
@@ -355,6 +355,36 @@ for buf in tbl.scan().to_arrow_batch_reader():
     print(f"Buffer contains {len(buf)} rows")
 ```
 
+You can control the number of rows per batch using the `batch_size` parameter:
+
+```python

Review Comment:
   done. added clarifying documentation



##########
tests/benchmark/test_read_benchmark.py:
##########
@@ -0,0 +1,168 @@
+# Licensed to the Apache Software Foundation (ASF) under one

Review Comment:
   ✅ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add streaming and concurrent file reads to ArrowScan to reduce memory usage and increase throughput [iceberg-python]

Reply via email to