Tommo56700 commented on issue #2407: URL: https://github.com/apache/iceberg-python/issues/2407#issuecomment-3836489988
`to_arrow_batch_reader()` is quite misleading in how it is documented: > For large results, using a RecordBatchReader requires less memory than loading an Arrow Table for the same DataScan, because a RecordBatch is read one at a time. https://github.com/apache/iceberg-python/blob/main/pyiceberg/io/pyarrow.py#L1759-L1803 Internally, `to_record_batches()` is called on the first iterated record batch. I see memory spike to 40GB after the first iteration. Internally, you can see that `def batches_for_task(task: FileScanTask) -> list[pa.RecordBatch]:` eagerly materialises a list of record batches, and this is done in parallel based on the number of pyiceberg workers you have set (defaults to number of cores). I think in reality number of cores correlate directly to the number of data files that are open at once (terrible design). I set `export PYICEBERG_MAX_WORKERS=2` and I see my max memory decrease dramatically, and my duration to process all record batches was unchanged. This is a very poor and misleading design for an interface that should be aimed at helping you reduce memory usage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
