[ 
https://issues.apache.org/jira/browse/ORC-262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved ORC-262.
-------------------------------
    Fix Version/s: 2.1.0
       Resolution: Fixed

Issue resolved by pull request 2048
[https://github.com/apache/orc/pull/2048]

> Support async prefetch in Orc reader
> ------------------------------------
>
>                 Key: ORC-262
>                 URL: https://issues.apache.org/jira/browse/ORC-262
>             Project: ORC
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Xiening Dai
>            Assignee: Taiyang Li
>            Priority: Major
>             Fix For: 2.1.0
>
>
> Currently RowReader::next() method reads a batch of rows and return them to 
> be processed by runtime. The function call is synchronized, meaning that the 
> execution thread is blocked while reader is loading data from disk. We could 
> potentially parallelize the execution and data loading through async prefetch 
> using logic described as below.
> In SeekableFileInputStream::Next(), we firstly check if the requested data 
> block is already prefetched, if yes, we simply return the buffer to the 
> caller, otherwise we issue a sync call to read data from file stream. No 
> matter how we load the requested data block, we always issue another async 
> call to prefetch the next block within current stream. 
> Additionally orc::InputStream will need a new method that does the async read 
> for a given offset and length.
> According to our experiment, async prefetch can significantly reduce the IO 
> wait time on a heavy loaded distributed file system. By carefully choosing 
> the prefetch data block size, we can maximize the parallelization of runtime 
> execution and data loading, and achieve a relatively high cache hit rate 
> (~85%).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to