mathewcohle commented on issue #2407:
URL: 
https://github.com/apache/iceberg-python/issues/2407#issuecomment-3607773753

   I think I'm running into the very same issue with the variation of following 
code:
   
   ```python
     import pyarrow as pa
     from memory_profiler import profile
     from pyiceberg.expressions import And, GreaterThanOrEqual,
     LessThan
     from pyiceberg.table import Table
   
   
     @profile
     def write_batches(dest_table: Table, batches:
     list[pa.RecordBatch]) -> None:
         combined = pa.Table.from_batches(batches)
         print(f"  combined.nbytes: {combined.nbytes / 1024 /
     1024:.1f} MiB")
         dest_table.append(combined)  # <-- 3GB spike here for 1MB
     data
   
   
     # Read data from source table
     source_table = source_catalog.load_table(...)
     scan = source_table.scan(
         row_filter=And(
             GreaterThanOrEqual("sent_at", partition_start),
             LessThan("sent_at", partition_end),
         )
     )
   
     # Accumulate batches
     batches = []
     for batch in scan.to_arrow_batch_reader():
         batches.append(batch)
         if sum(b.num_rows for b in batches) >= N:
             write_batches(dest_table, batches)
             batches = []
   ```
   
   Running the profiler gives:
   ```
     # Output:
     #   combined.nbytes: 1.1 MiB
     #   Line: dest_table.append(combined)  ->  +2946.1 MiB
   ```
   
   in case it would help i can provide working example with data


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to