[
https://issues.apache.org/jira/browse/ARROW-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178980#comment-17178980
]
Wes McKinney commented on ARROW-9633:
-------------------------------------
I mostly want to be sure that file formats that are sensitive to a file
handle's performance characteristics (for example, Parquet files are highly
sensitive to the latency of reads) are able to understand what they are getting
so that they can choose to set other options to improve performance. For
example:
* Will read buffering (or pre-buffering) to improve performance?
* Is it OK to make blocking IO calls or should an IO call allow a CPU core to
be made available to other threads for execution?
* Do Read calls allocate memory?
I'm all for abstraction/encapsulation where it makes sense but these issues can
result in meaningful changes to the wall clock time of accessing data.
I'm fine to take no action right now but if we want Arrow to be the gold
standard for data access and the platform that people choose to build on we
should be vigilant.
> [C++] Do not toggle memory mapping globally in LocalFileSystem
> --------------------------------------------------------------
>
> Key: ARROW-9633
> URL: https://issues.apache.org/jira/browse/ARROW-9633
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 2.0.0
>
>
> In the context of the Datasets API, some file formats benefit greatly from
> memory mapping (like Arrow IPC files) while other less so. Additionally, in
> some scenarios, memory mapping could fail when used on network-attached
> storage devices. Since a filesystem may be used to read different kinds of
> files and use both memory mapping and non-memory mapping, and additionally
> the Datasets API should be able to fall back on non-memory mapping if the
> attempt to memory map fails, it would make sense to have a non-global option
> for this:
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/localfs.h
> I would suggest adding a new filesystem API with something like
> {{OpenMappedInputFile}} with some options to control the behavior when memory
> mapping is not possible. These options may be among:
> * Falling back on a normal RandomAccessFile
> * Reading the entire file into memory (or even tmpfs?) and then wrapping it
> in a BufferReader
> * Failing
--
This message was sent by Atlassian Jira
(v8.3.4#803005)