mattwarkentin opened a new issue, #14880:
URL: https://github.com/apache/arrow/issues/14880
### Describe the usage question you have. Please include as many useful
details as possible.
Hi,
I am wondering if someone from the Arrow team could offer some guidance on
best practices for handling very large data in an optimal way (such as if
partitioning is even the answer). The specific data is a TSV file that is 26Gb
on disk and ~50Gb in-memory when read into R. The data frame is ~500K rows and
~14K columns. It is prohibitively slow/memory intensive to read the full data
across each of several projects when, typically, only a small subset of the
data (either subset of rows or columns) is relevant for any given project.
However, the filtering conditions for which subset changes project to
project, So I don't see an obvious column to use for grouping and partitioning.
Does it ever make sense to randomly chunk/partition the data into smaller sets
of 5000-10000 observations? My understanding was that much of the memory gain
would occur if you chunked on a sensible variable (e.g., `year`) and then when
you `filter()` a certain year, some of the data sets won't even be
touched/loaded. Is there any way random chunking of observations offers any
time/memory advantage?
Most commonly, most/all rows but only a very small set of columns are
needed. I had hoped that something like the following would work, where `...`
is just a small set of column names:
```r
ds <- arrow::open_dataset('data.tsv', format = 'tsv')
df <- ds |> dplyr::select(...) |> dplyr::collect()
```
But this is seemingly just as slow as loading the full table. I had thought
only `...` columns would be read into memory so there would be a time savings.
Anyway, any suggestions? Am I fundamentally misunderstanding how to handle
larger-than-memory data with `arrow`?
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]