[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770195#comment-17770195 ]
ASF GitHub Bot commented on PARQUET-2171: ----------------------------------------- parthchandra commented on PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1739813955 @mukund-thakur @steveloughran this is a great PR! Some numbers from an independent benchmark. I used Spark to parallelize the reading of all rowgroups (just the reading of the raw data) from TPC-DS/SF10000/store_sales using various APIS and here are some numbers for you. 32 executors, 16 cores `fs.s3a.threads.max` = 20 Reader | Avg Time (minutes) | Median | vs Baseline > Implement vectored IO in parquet file format > -------------------------------------------- > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr > Reporter: Mukund Thakur > Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found hereĀ > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)