[ 
https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579770#comment-17579770
 ] 

Timothy Miller commented on PARQUET-2171:
-----------------------------------------

The parquet reader has two phases of reading. One does the raw I/O and 
decompression. Someone is working on an asynchronous implementation of this, 
which should help a lot. The second phase works on the output of that, 
providing higher-level data types. My PRs improve on this by eliminating 
LittleEndianInputStream, which was super inefficient, plus some other 
improvements in the most critical paths. All of these improvements are 
incremental, of course, and we're happy to get contributions that improve on 
this further.

> Implement vectored IO in parquet file format
> --------------------------------------------
>
>                 Key: PARQUET-2171
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2171
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Mukund Thakur
>            Priority: Major
>
> We recently added a new feature called vectored IO in Hadoop for improving 
> read performance for seek heavy readers. Spark Jobs and others which uses 
> parquet will greatly benefit from this api. Details can be found hereĀ 
> [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5]
> https://issues.apache.org/jira/browse/HADOOP-18103
> https://issues.apache.org/jira/browse/HADOOP-11867



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to