Re: [I] [Python] Dataset.to_batches accumulates memory usage and leaks [arrow]

via GitHub Wed, 12 Jun 2024 07:37:17 -0700


daverigby commented on issue #39808:
URL: https://github.com/apache/arrow/issues/39808#issuecomment-2163183635


   I am observing the same thing with a single parquet file of 10M records 
(https://storage.googleapis.com/pinecone-datasets-dev/yfcc-10M-filter-euclidean-formatted/passages/part-0.parquet
 - 2.3GB).
   
   Using code equivalent to OPs, I see `total_allocated_bytes` increase 
consistently over the run, requiring over 5.9GB to iterate the file.
   
   Using `ParquetFile.iter_batches` as suggested, memory usage is much more 
stable (although increases a little over the duration).
   
   Variation of the OPs reproducer code which includes the two modes:
   ```Python
   #!/usr/bin/env python3
   
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   import sys
   
   path="part-0.parquet"
   
   if sys.argv[1] == "Dataset":
       print("Using Dataset.to_batches API")    
       batches = ds.dataset(path).to_batches(batch_size=100,
                                             batch_readahead=0,
                                             fragment_readahead=0)
   else:
       print("Using ParquetFile.iter_batches API")
       batches = pq.ParquetFile(path).iter_batches(batch_size=100)
   
   # Iterate through batches
   max_alloc = 0
   for batch in batches:
       alloc = pa.total_allocated_bytes()
       if alloc > max_alloc:
           max_alloc = alloc
           print("New max total_allocated_bytes", max_alloc)
   del batches
   print("Final:", pa.total_allocated_bytes())
   ```
   
   I see the following numbers (pyarrow 16.1.0, python 3.11.6):
   
   * Dataset.to_batches:
   ```shell
   ./pyarrow_39808_repro.py Dataset
   Using Dataset.to_batches API
   New max total_allocated_bytes 549322688
   New max total_allocated_bytes 794176064
   New max total_allocated_bytes 1094879616
   New max total_allocated_bytes 1340710656
   New max total_allocated_bytes 3732962048
   New max total_allocated_bytes 4236869504
   New max total_allocated_bytes 4849658496
   New max total_allocated_bytes 6188106432
   Final: 185074688
   ```
   
   * ParquetFile.iter_batches:
   ```
   ./pyarrow_39808_repro.py ParquetFile
   Using ParquetFile.iter_batches API     
   New max total_allocated_bytes 252396608
   New max total_allocated_bytes 252403456
   <cut>
   New max total_allocated_bytes 274519744
   New max total_allocated_bytes 274521024
   Final: 35072
   ```
   
   i.e. `ParquetFile.iter_batches` requires at most ~261MB, whereas 
`Dataset.to_batches` requires ~5901MB - 22x more (!) - plus an additinal 176MB 
of RAM still in use after iteration complete.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Python] Dataset.to_batches accumulates memory usage and leaks [arrow]

Reply via email to