[I] Dataset.to_batches accumulates memory usage and leaks [arrow]

via GitHub Fri, 26 Jan 2024 07:45:26 -0800


akoumjian opened a new issue, #39808:
URL: https://github.com/apache/arrow/issues/39808


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   If you want to read in a large parquet file or series of parquet files, the 
dataset reader accumulates the memory it allocates as you iterate through the 
batches.
   
   To recreate:
   
   
   ```python
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   # Create dataset from large parquet file or files
   dataset = ds.dataset('data/parquet', format='parquet')
   
   # Iterate through batches
   for batch in dataset.to_batches(batch_size=1000, batch_readahead=0, 
fragment_readahead=0):
       print(pa.total_allocated_bytes())
       pass
   print(pa.total_allocated_bytes())
   ````
   
   I am running this on OSX, which I believe uses `mimalloc` backend by 
default. It's worth noting that this is not the behavior that 
`ParquetFile.iter_batches` has. If you swap in that iterator, it will 
de-allocate the memory as soon as the batch leaves scope.
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Dataset.to_batches accumulates memory usage and leaks [arrow]

Reply via email to