ShreyeshArangath opened a new pull request, #3037: URL: https://github.com/apache/iceberg-python/pull/3037
WIP not ready for review: Closes #3036 # Rationale for this change ArrowScan in PyIceberg does not support true streaming, leading to OOM failures when processing large files (file size > container size). While the API returns an iterator, the implementation eagerly materializes all record batches for a FileScanTask before yielding the first row. Two primary bottlenecks were identified in the pyiceberg.io.pyarrow implementation: The internal scan logic uses a list() constructor on the batch iterator, forcing the entire file into memory. The batch_size parameter is not forwarded to the underlying PyArrow ds.Scanner, preventing granular memory control. Though, it does fallback to the standard This behavior makes it impossible to process files larger than the available memory in distributed environments (e.g., Ray workers) ## Are these changes tested? Yes, tested ## Are there any user-facing changes? Yes, new API in ArrowScan `to_record_batch_stream ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
