psavalle commented on PR #1995: URL: https://github.com/apache/iceberg-python/pull/1995#issuecomment-3289304023
It doesn't look like this would solve the problem: even with a single thread, the new implementation still seems to pre-fetch all of the data in memory, irrespective of whether the iterator of record batches in being consumed. If the scan has more data files than there is memory available, it would still run out of memory. I think the point of returning an `Iterator[pa.RecordBatch]` is that we should only fetch the next batch when we try to consume the next item from the iterator. For performance, it might be useful to still allow pre-fetching the next batch in the background, but ideally with an explicitly configurable parameter. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
