Re: [I] ArrowScan materializes entire FileScanTask into memory [iceberg-python]

via GitHub Wed, 11 Feb 2026 16:45:35 -0800


ShreyeshArangath commented on issue #3036:
URL: 
https://github.com/apache/iceberg-python/issues/3036#issuecomment-3888032155


   I'd like to work on this issue, but I wanted to check with the community 
about if the way it is structured is intentional or not. If the existing 
assumption is that we should be able to read everything into memory (for 
testing-like usecases)
   
   I propose refactoring the internal ArrowScan logic to support true 
incremental yielding. Here is how I'm thinking about it. 
   1. Modify `_task_to_record_batches`: Update the signature to accept 
batch_size and pass it into the scanner_kwargs.
   2. Remove Eager Constructors: Replace `list()` calls with yield from 
statements to maintain the iterator chain from the PyArrow fragment scanner all 
the way to the user.
   3. Enhance DataScan API: Introduce a streaming-first method (e.g., 
`to_record_batches()`) that provides users a direct path to lazy evaluation 
without the overhead of the current `to_arrow_batches()` logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] ArrowScan materializes entire FileScanTask into memory [iceberg-python]

Reply via email to