[GitHub] [arrow] ravwojdyla opened a new issue, #35393: High (resident) memory usage when fetching Parquet metadata/schema

via GitHub Tue, 02 May 2023 11:27:46 -0700


ravwojdyla opened a new issue, #35393:
URL: https://github.com/apache/arrow/issues/35393


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   We have a code to fetch parquet schema from a file using pyarrow, here's a 
minimal example:
   
   ```py
   import pyarrow.parquet as pq
   
   with open("/tmp/part.snappy.parquet", mode="rb") as fd:
       s = pq.read_schema(fd)
   ```
   
   That example file is about 288MB, we've notice that the resident memory 
usage of this code spikes close to 500MB:
   
   <img width="1124" alt="image" 
src="https://user-images.githubusercontent.com/1419010/235752389-504c0e3c-93ef-4a54-8bfc-62aed6d85417.png";>
   
   Is this expected that to fetch schema, we need to allocate so much memory? 
Worth noting that this memory is eventually freed up. Should some arguments be 
tweaked or is this a bug somewhere?
   
   
   ```sh
   > du -sh /tmp/part.snappy.parquet
   288M    /tmp/part.snappy.parquet
   ```
   
   Versions (py 3.10):
   ```
   > conda list | grep arrow
   arrow-cpp                 12.0.0           hce30654_0_cpu    conda-forge
   libarrow                  12.0.0           h3b4cbd9_0_cpu    conda-forge
   pyarrow                   12.0.0          py310h7c67832_0_cpu    conda-forge
   ```
   
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ravwojdyla opened a new issue, #35393: High (resident) memory usage when fetching Parquet metadata/schema

Reply via email to