neerajd12 opened a new issue, #36754: URL: https://github.com/apache/arrow/issues/36754
### Describe the bug, including details regarding any error messages, version, and platform. ### Describe the bug, including details regarding any error messages, version, and platform. dataset.head loads all data in memory and doesn't release it. when it should just load the top n rows. This issue started after July 17 2023. ## Versions Pyarrow : 12.0.0 Python: 3.10.6 Jupyter lab: 3.3.4 on Docker: 4.12.0 (85629) on windows 10, version: 21H2, build: 19044.3086 ## Sample data https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2009-01.parquet For all the months ## Sample Code 1. Install memory_profiler ``` pip3 install memory_profiler ``` 2. load Extension and Check mem ``` %load_ext memory_profiler %memit ``` peak memory: 163.00 MiB, increment: 0.21 MiB 3. Create Dataset ``` import pyarrow.dataset as ds data = ds.dataset('./testdata/nyc/year=2009', format='parquet', partitioning='hive') ``` 4. Check mem ``` %memit ``` peak memory: 157.97 MiB, increment: 0.01 MiB 5. Count rows ``` data.count_rows() ``` 170896055 6. Check mem ``` %memit ``` peak memory: 170.34 MiB, increment: 0.02 MiB 7. get first 10 rows ``` data.head(10).to_pandas() ``` 8. Check Mem ``` %memit ``` peak memory: 11753.76 MiB, increment: 142.51 MiB peak memory: 9914.21 MiB, increment: 0.00 MiB peak memory: 9914.21 MiB, increment: 0.00 MiB peak memory: 9914.21 MiB, increment: 0.00 MiB ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
