ShreyeshArangath commented on issue #3036: URL: https://github.com/apache/iceberg-python/issues/3036#issuecomment-3888032155
I'd like to work on this issue, but I wanted to check with the community about if the way it is structured is intentional or not. If the existing assumption is that we should be able to read everything into memory (for testing-like usecases) I propose refactoring the internal ArrowScan logic to support true incremental yielding. Here is how I'm thinking about it. 1. Modify `_task_to_record_batches`: Update the signature to accept batch_size and pass it into the scanner_kwargs. 2. Remove Eager Constructors: Replace `list()` calls with yield from statements to maintain the iterator chain from the PyArrow fragment scanner all the way to the user. 3. Enhance DataScan API: Introduce a streaming-first method (e.g., `to_record_batches()`) that provides users a direct path to lazy evaluation without the overhead of the current `to_arrow_batches()` logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
