Re: [PR] feat: Add streaming and concurrent file reads to ArrowScan to reduce memory usage and increase throughput [iceberg-python]

via GitHub Mon, 16 Feb 2026 20:52:15 -0800


sumedhsakdeo commented on code in PR #3046:
URL: https://github.com/apache/iceberg-python/pull/3046#discussion_r2814969439



##########
mkdocs/docs/api.md:
##########
@@ -355,6 +355,36 @@ for buf in tbl.scan().to_arrow_batch_reader():
     print(f"Buffer contains {len(buf)} rows")
 ```
 
+You can control the number of rows per batch using the `batch_size` parameter:
+
+```python
+for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
+    print(f"Buffer contains {len(buf)} rows")
+```
+
+By default, each file's batches are materialized in memory before being 
yielded. For large files that may exceed available memory, use `streaming=True` 
to yield batches as they are produced without materializing entire files:
+
+```python
+for buf in tbl.scan().to_arrow_batch_reader(streaming=True, batch_size=1000):
+    print(f"Buffer contains {len(buf)} rows")
+```
+
+For maximum throughput, use `concurrent_files` to read multiple files in 
parallel while streaming. Batches are yielded as they arrive from any file — 
ordering across files is not guaranteed:

Review Comment:
   I think I like the param to be `order`, `streaming` is confusing. Let me fix 
that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Add streaming and concurrent file reads to ArrowScan to reduce memory usage and increase throughput [iceberg-python]

Reply via email to