[GitHub] [arrow] neerajd12 opened a new issue, #36754: Memory leak with dataset.head

via GitHub Tue, 18 Jul 2023 08:04:44 -0700


neerajd12 opened a new issue, #36754:
URL: https://github.com/apache/arrow/issues/36754


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   dataset.head loads all data in memory and doesn't release it. when it should 
just load the top n rows.
   
   This issue started after July 17 2023.
   
   ## Versions
   Pyarrow : 12.0.0
   Python: 3.10.6
   Jupyter lab: 3.3.4 on
   Docker: 4.12.0 (85629) on
   windows 10, version: 21H2,  build: 19044.3086
   
   ## Sample data
   
   
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet 
  For all the months
   
   ## Sample Code
   
   1. Install memory_profiler
   ```
   pip3 install memory_profiler
   ```
   2. load Extension and Check mem
   ```
   %load_ext memory_profiler
   %memit
   ```
   peak memory: 163.00 MiB, increment: 0.21 MiB
   
   3.  Create Dataset
   ```
   import pyarrow.dataset as ds
   data = ds.dataset('./testdata/nyc/year=2009', format='parquet', 
partitioning='hive')
   ```
   4. Check mem
   ```
   %memit
   ```
   
   peak memory: 157.97 MiB, increment: 0.01 MiB
   
   5. Count rows
   ```
   data.count_rows()
   ```
   170896055
   
   6. Check mem
   ```
   %memit
   ```
   peak memory: 170.34 MiB, increment: 0.02 MiB
   
   7.  get first 10 rows
   ```
   data.head(10).to_pandas()
   ```
   8. Check Mem
   ```
   %memit
   ```
   
   peak memory: 11753.76 MiB, increment: 142.51 MiB
   peak memory: 9914.21 MiB, increment: 0.00 MiB
   peak memory: 9914.21 MiB, increment: 0.00 MiB
   peak memory: 9914.21 MiB, increment: 0.00 MiB
   
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] neerajd12 opened a new issue, #36754: Memory leak with dataset.head

Reply via email to