mathewcohle commented on issue #2407:
URL:
https://github.com/apache/iceberg-python/issues/2407#issuecomment-3607773753
I think I'm running into the very same issue with the variation of following
code:
```python
import pyarrow as pa
from memory_profiler import profile
from pyiceberg.expressions import And, GreaterThanOrEqual,
LessThan
from pyiceberg.table import Table
@profile
def write_batches(dest_table: Table, batches:
list[pa.RecordBatch]) -> None:
combined = pa.Table.from_batches(batches)
print(f" combined.nbytes: {combined.nbytes / 1024 /
1024:.1f} MiB")
dest_table.append(combined) # <-- 3GB spike here for 1MB
data
# Read data from source table
source_table = source_catalog.load_table(...)
scan = source_table.scan(
row_filter=And(
GreaterThanOrEqual("sent_at", partition_start),
LessThan("sent_at", partition_end),
)
)
# Accumulate batches
batches = []
for batch in scan.to_arrow_batch_reader():
batches.append(batch)
if sum(b.num_rows for b in batches) >= N:
write_batches(dest_table, batches)
batches = []
```
Running the profiler gives:
```
# Output:
# combined.nbytes: 1.1 MiB
# Line: dest_table.append(combined) -> +2946.1 MiB
```
in case it would help i can provide working example with data
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]