[
https://issues.apache.org/jira/browse/ORC-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun resolved ORC-262.
-------------------------------
Fix Version/s: 2.1.0
Resolution: Fixed
Issue resolved by pull request 2048
[https://github.com/apache/orc/pull/2048]
> Support async prefetch in Orc reader
> ------------------------------------
>
> Key: ORC-262
> URL: https://issues.apache.org/jira/browse/ORC-262
> Project: ORC
> Issue Type: Improvement
> Components: C++
> Reporter: Xiening Dai
> Assignee: Taiyang Li
> Priority: Major
> Fix For: 2.1.0
>
>
> Currently RowReader::next() method reads a batch of rows and return them to
> be processed by runtime. The function call is synchronized, meaning that the
> execution thread is blocked while reader is loading data from disk. We could
> potentially parallelize the execution and data loading through async prefetch
> using logic described as below.
> In SeekableFileInputStream::Next(), we firstly check if the requested data
> block is already prefetched, if yes, we simply return the buffer to the
> caller, otherwise we issue a sync call to read data from file stream. No
> matter how we load the requested data block, we always issue another async
> call to prefetch the next block within current stream.
> Additionally orc::InputStream will need a new method that does the async read
> for a given offset and length.
> According to our experiment, async prefetch can significantly reduce the IO
> wait time on a heavy loaded distributed file system. By carefully choosing
> the prefetch data block size, we can maximize the parallelization of runtime
> execution and data loading, and achieve a relatively high cache hit rate
> (~85%).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)