[ https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776719#comment-17776719 ]
ASF GitHub Bot commented on PARQUET-2366: ----------------------------------------- ConeyLiu commented on code in PR #1174: URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363996076 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ########## @@ -92,16 +95,21 @@ public FileEncryptionProperties getFileEncryptionProperties() { return fileEncryptionProperties; } + public boolean prefetchBlockAllIndexes() { + return prefetchBlockAllIndexes; + } + // Builder to create a RewriterOptions. public static class Builder { - private Configuration conf; - private List<Path> inputFiles; - private Path outputFile; + private final Configuration conf; + private final List<Path> inputFiles; + private final Path outputFile; private List<String> pruneColumns; private CompressionCodecName newCodecName; private Map<String, MaskMode> maskColumns; private List<String> encryptColumns; private FileEncryptionProperties fileEncryptionProperties; + private boolean prefetchBlockAllIndexes; Review Comment: Disabled by default to keep existing behaviors. > Optimize random seek during rewriting > ------------------------------------- > > Key: PARQUET-2366 > URL: https://issues.apache.org/jira/browse/PARQUET-2366 > Project: Parquet > Issue Type: Improvement > Reporter: Xianyang Liu > Priority: Major > > The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of > the file. We need to randomly seek 4 times when rewriting a column chunk. We > found this could impact the rewrite performance heavily for files with a > number of columns(~1000). In this PR, we read the `ColumnIndex`, > `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We > got about 60 times performance improvement in production environments for the > files with about one thousand columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)