[
https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861119#comment-16861119
]
Wes McKinney commented on ARROW-5502:
-------------------------------------
The Parquet C++ library by default only reads the serialized column data from
disk that needs to be deserialized. Using memory-mapping indeed avoids memory
allocation.
Note that for high latency file sources (like Amazon S3) -- where memory
mapping is not possible -- many data warehousing systems have found it more
efficient to read an entire Parquet row group into memory at a time and discard
the unused columns. We will likely be forced as a matter of performance
optimization to add some reader options to parquet-cpp around this issue
> [R] file readers should mmap
> ----------------------------
>
> Key: ARROW-5502
> URL: https://issues.apache.org/jira/browse/ARROW-5502
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Neal Richardson
> Priority: Major
> Fix For: 0.14.0
>
>
> Arrow is supposed to let you work with datasets bigger than memory. Memory
> mapping is a big part of that. It should be the default way that files are
> read in the `read_*` functions. To disable memory mapping, we could use a
> global `option()`, or a function argument, but that might clutter the
> interface. Or we could not give a choice and only fall back to not memory
> mapping if the platform/file system doesn't support it.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)