ShreyeshArangath opened a new pull request, #3037:
URL: https://github.com/apache/iceberg-python/pull/3037

   
   WIP not ready for review: 
   Closes #3036 
   
   # Rationale for this change
   ArrowScan in PyIceberg does not support true streaming, leading to OOM 
failures when processing large files (file size > container size). While the 
API returns an iterator, the implementation eagerly materializes all record 
batches for a FileScanTask before yielding the first row.
   
   Two primary bottlenecks were identified in the pyiceberg.io.pyarrow 
implementation:
   
   The internal scan logic uses a list() constructor on the batch iterator, 
forcing the entire file into memory.
   The batch_size parameter is not forwarded to the underlying PyArrow 
ds.Scanner, preventing granular memory control. Though, it does fallback to the 
standard
   This behavior makes it impossible to process files larger than the available 
memory in distributed environments (e.g., Ray workers)
   
   
   ## Are these changes tested?
   Yes, tested 
   
   ## Are there any user-facing changes?
   Yes, new API in ArrowScan  `to_record_batch_stream `
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to