[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17579770#comment-17579770 ]
Timothy Miller commented on PARQUET-2171: ----------------------------------------- The parquet reader has two phases of reading. One does the raw I/O and decompression. Someone is working on an asynchronous implementation of this, which should help a lot. The second phase works on the output of that, providing higher-level data types. My PRs improve on this by eliminating LittleEndianInputStream, which was super inefficient, plus some other improvements in the most critical paths. All of these improvements are incremental, of course, and we're happy to get contributions that improve on this further. > Implement vectored IO in parquet file format > -------------------------------------------- > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr > Reporter: Mukund Thakur > Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found hereĀ > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)