[GitHub] [arrow] lnicola opened a new issue #7338: [Python] DataSet uses too much memory when filtering

GitBox Tue, 02 Jun 2020 22:23:45 -0700


lnicola opened a new issue #7338:
URL: https://github.com/apache/arrow/issues/7338



   I'm running this query over a 14 GB Arrow IPC file:
   
   ```python
   >>> ds = dataset.dataset("foo.ipc", format="ipc")
   >>> t = ds.to_table(filter=dataset.field('ID') <= 1000).to_pandas()
   >>> t
   [snip]
   [914 rows x 617 columns]
   ```
   
   If I'm reading the documentation correctly, it should scan the file 
collecting the results, but not load it in memory. However, the RSS grows up to 
about 14 GB while running it, then goes back down.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lnicola opened a new issue #7338: [Python] DataSet uses too much memory when filtering

Reply via email to