daverigby commented on issue #39808: URL: https://github.com/apache/arrow/issues/39808#issuecomment-2163183635
I am observing the same thing with a single parquet file of 10M records (https://storage.googleapis.com/pinecone-datasets-dev/yfcc-10M-filter-euclidean-formatted/passages/part-0.parquet - 2.3GB). Using code equivalent to OPs, I see `total_allocated_bytes` increase consistently over the run, requiring over 5.9GB to iterate the file. Using `ParquetFile.iter_batches` as suggested, memory usage is much more stable (although increases a little over the duration). Variation of the OPs reproducer code which includes the two modes: ```Python #!/usr/bin/env python3 import pyarrow as pa import pyarrow.parquet as pq import pyarrow.dataset as ds import sys path="part-0.parquet" if sys.argv[1] == "Dataset": print("Using Dataset.to_batches API") batches = ds.dataset(path).to_batches(batch_size=100, batch_readahead=0, fragment_readahead=0) else: print("Using ParquetFile.iter_batches API") batches = pq.ParquetFile(path).iter_batches(batch_size=100) # Iterate through batches max_alloc = 0 for batch in batches: alloc = pa.total_allocated_bytes() if alloc > max_alloc: max_alloc = alloc print("New max total_allocated_bytes", max_alloc) del batches print("Final:", pa.total_allocated_bytes()) ``` I see the following numbers (pyarrow 16.1.0, python 3.11.6): * Dataset.to_batches: ```shell ./pyarrow_39808_repro.py Dataset Using Dataset.to_batches API New max total_allocated_bytes 549322688 New max total_allocated_bytes 794176064 New max total_allocated_bytes 1094879616 New max total_allocated_bytes 1340710656 New max total_allocated_bytes 3732962048 New max total_allocated_bytes 4236869504 New max total_allocated_bytes 4849658496 New max total_allocated_bytes 6188106432 Final: 185074688 ``` * ParquetFile.iter_batches: ``` ./pyarrow_39808_repro.py ParquetFile Using ParquetFile.iter_batches API New max total_allocated_bytes 252396608 New max total_allocated_bytes 252403456 <cut> New max total_allocated_bytes 274519744 New max total_allocated_bytes 274521024 Final: 35072 ``` i.e. `ParquetFile.iter_batches` requires at most ~261MB, whereas `Dataset.to_batches` requires ~5901MB - 22x more (!) - plus an additinal 176MB of RAM still in use after iteration complete. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
