[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776272#comment-17776272 ]
ASF GitHub Bot commented on PARQUET-2171: ----------------------------------------- parthchandra commented on PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1766778525 @ahmarsuhail No these numbers are not with iceberg and S3FileIO. I used a modified (lots of stuff removed) version of the ParquetFileReader and a custom benchmark program that reads all the row groups in parallel and records the time spent in each read from S3. The modified version of ParquetFileReader can switch between the various methods of reading from S3. The entry `AWS SDK V2` is a near copy of the Iceberg S3FileIO code though. I saw issues with the CRT client when running at scale causing JVM crashes. And the V2 transfer manager did not do range reads properly. Do share your experience. > Implement vectored IO in parquet file format > -------------------------------------------- > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr > Reporter: Mukund Thakur > Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found hereĀ > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)