Re: [I] [Python] Dataset.to_batches accumulates memory usage and leaks [arrow]

via GitHub Wed, 23 Apr 2025 18:25:06 -0700


wingkitlee0 commented on issue #39808:
URL: https://github.com/apache/arrow/issues/39808#issuecomment-2825922411


   Came across this issue recently and I can still see this 
https://github.com/apache/arrow/issues/39808#issuecomment-2163183635
   
   Previously I tried `pre_buffer=False` and `use_stream_buffer=True` in 
[`ParquetFragmentScanOptions` 
](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.ParquetFragmentScanOptions.html),
 in which the `total_allocated_bytes` stopped growing.
   
   There is also a new option `cache_metadata` in `to_batches` (not released 
yet; only dev version), which seems to reduce some %.
   
   However, the "memory usage" difference between dataset and ParquetFile is 
still quite big.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [I] [Python] Dataset.to_batches accumulates memory usage and leaks [arrow]

Reply via email to