akoumjian opened a new issue, #39808:
URL: https://github.com/apache/arrow/issues/39808
### Describe the bug, including details regarding any error messages,
version, and platform.
If you want to read in a large parquet file or series of parquet files, the
dataset reader accumulates the memory it allocates as you iterate through the
batches.
To recreate:
```python
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# Create dataset from large parquet file or files
dataset = ds.dataset('data/parquet', format='parquet')
# Iterate through batches
for batch in dataset.to_batches(batch_size=1000, batch_readahead=0,
fragment_readahead=0):
print(pa.total_allocated_bytes())
pass
print(pa.total_allocated_bytes())
````
I am running this on OSX, which I believe uses `mimalloc` backend by
default. It's worth noting that this is not the behavior that
`ParquetFile.iter_batches` has. If you swap in that iterator, it will
de-allocate the memory as soon as the batch leaves scope.
### Component(s)
Parquet, Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]