[PR] perf: iterate over generators when writing datafiles to reduce memory pressure [iceberg-python]

via GitHub Tue, 28 Oct 2025 09:42:45 -0700


hamilton-earthscope opened a new pull request, #2671:
URL: https://github.com/apache/iceberg-python/pull/2671


   # Rationale for this change
   
   When writing to partitioned tables, there is a large memory spike when the 
partitions are computed because we `.combine_chunks()` on the new partitioned 
arrow tables and we materialize the entire list of partitions before writing 
data files.
   
   This PR switches the partition computation to a generator to avoid 
materializing all the partitions in memory at once, reducing the memory 
overhead of writing to partitioned tables.
   
   ## Are these changes tested?
   
   No new tests. The tests using this method were updated to consume the 
generator as a list.
   
   However, in my personal use case, I am using `pa.total_allocated_bytes()` to 
determine memory allocation before and after the write and see the following 
across 5 writes:
   
   | Run | Original Impl (Before Write) | Original Impl (After Write) | Iters 
(Before Write) | Iters (After Write) |
   |---|---|---|---|---|
   | 1 | 29.31 MB | 151.62 MB | 28.38 MB | 30.40 MB |
   | 2 | 27.74 MB | 151.62 MB | 28.85 MB | 30.36 MB |
   | 3 | 28.81 MB | 151.62 MB | 28.52 MB | 31.29 MB |
   | 4 | 28.71 MB | 151.62 MB | 29.27 MB | 30.64 MB |
   | 5 | 28.60 MB | 151.61 MB | 28.29 MB | 31.11 MB |
   
   
   ## Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] perf: iterate over generators when writing datafiles to reduce memory pressure [iceberg-python]

Reply via email to