[PR] do not materialize entire files in `to_record_batches` [iceberg-python]

via GitHub Fri, 31 Oct 2025 16:22:16 -0700


tom-at-rewbi opened a new pull request, #2676:
URL: https://github.com/apache/iceberg-python/pull/2676


   <!--
   Thanks for opening a pull request!
   -->
   
   <!-- In the case this PR will resolve an issue, please replace 
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
   <!-- Closes #${GITHUB_ISSUE_ID} -->
   
   # Rationale for this change
   
   My expectation for `to_record_batches` was that it would yield batches and 
not materialize an entire parquet file in memory, but it looks like the current 
implementation explicitly does this.
   
   This change makes `to_record_batches` iterate batches lazily.
   
   ## Are these changes tested?
   
   I ran `make test` and all tests completed successfully except those that 
import kerberos (2 of them), which I can't seem to build at the moment.
   
   As for testing that this reduces memory usage, this change made my data 
pipeline stop OOM'ing.
   
   ## Are there any user-facing changes?
   
   There should not be any user-facing changes.
   <!-- In the case of user-facing changes, please add the changelog label. -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] do not materialize entire files in `to_record_batches` [iceberg-python]

Reply via email to