wingkitlee0 commented on issue #39808: URL: https://github.com/apache/arrow/issues/39808#issuecomment-2954260310
Sharing a plot that I made a little while ago, using pyarrow 20.x and the following options in `to_batches` (or `scanner`): ``` batch_readahead=0, cache_metadata=False, # new fragment_scan_options=pyarrow.dataset.ParquetFragmentScanOptions( use_buffered_stream=True, pre_buffer=False, cache_options=pa.CacheOptions(lazy=True, prefetch_limit=0), ), ```  I used the 2.3gb parquet file earlier in the thread, which has about 10 row groups. In the figure, the blue lines are default option, orange/green are using `cache_metadata=False` etc. `b5000` etc are the batch size. The top panel shows the memory used by the current batch. The middle and bottom panels are pa.total_allocated_bytes and rss (from psutil), respectively. There are 9-10 spikes, which seem to be at the beginning of each row group. The memory usage is still rising with the extra options, though much slower. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org