[GitHub] [arrow] mattwarkentin opened a new issue, #14880: Best practices for handling larger than memory data

GitBox Wed, 07 Dec 2022 13:41:39 -0800


mattwarkentin opened a new issue, #14880:
URL: https://github.com/apache/arrow/issues/14880


   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi,
   
   I am wondering if someone from the Arrow team could offer some guidance on 
best practices for handling very large data in an optimal way (such as if 
partitioning is even the answer). The specific data is a TSV file that is 26Gb 
on disk and ~50Gb in-memory when read into R. The data frame is ~500K rows and 
~14K columns. It is prohibitively slow/memory intensive to read the full data 
across each of several projects when, typically, only a small subset of the 
data (either subset of rows or columns) is relevant for any given project. 
   
   However, the filtering conditions for which subset changes project to 
project, So I don't see an obvious column to use for grouping and partitioning. 
Does it ever make sense to randomly chunk/partition the data into smaller sets 
of 5000-10000 observations? My understanding was that much of the memory gain 
would occur if you chunked on a sensible variable (e.g., `year`) and then when 
you `filter()` a certain year, some of the data sets won't even be 
touched/loaded. Is there any way random chunking of observations offers any 
time/memory advantage?
   
   Most commonly, most/all rows but only a very small set of columns are 
needed. I had hoped that something like the following would work, where `...` 
is just a small set of column names:
   ```r
   ds <- arrow::open_dataset('data.tsv', format = 'tsv')
   df <- ds |> dplyr::select(...) |> dplyr::collect()
   ```
   But this is seemingly just as slow as loading the full table. I had thought 
only `...` columns would be read into memory so there would be a time savings. 
   
   Anyway, any suggestions? Am I fundamentally misunderstanding how to handle 
larger-than-memory data with `arrow`?
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] mattwarkentin opened a new issue, #14880: Best practices for handling larger than memory data

Reply via email to