[ https://issues.apache.org/jira/browse/ARROW-5502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861119#comment-16861119 ]
Wes McKinney commented on ARROW-5502: ------------------------------------- The Parquet C++ library by default only reads the serialized column data from disk that needs to be deserialized. Using memory-mapping indeed avoids memory allocation. Note that for high latency file sources (like Amazon S3) -- where memory mapping is not possible -- many data warehousing systems have found it more efficient to read an entire Parquet row group into memory at a time and discard the unused columns. We will likely be forced as a matter of performance optimization to add some reader options to parquet-cpp around this issue > [R] file readers should mmap > ---------------------------- > > Key: ARROW-5502 > URL: https://issues.apache.org/jira/browse/ARROW-5502 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Neal Richardson > Priority: Major > Fix For: 0.14.0 > > > Arrow is supposed to let you work with datasets bigger than memory. Memory > mapping is a big part of that. It should be the default way that files are > read in the `read_*` functions. To disable memory mapping, we could use a > global `option()`, or a function argument, but that might clutter the > interface. Or we could not give a choice and only fall back to not memory > mapping if the platform/file system doesn't support it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)