[
https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776719#comment-17776719
]
ASF GitHub Bot commented on PARQUET-2366:
-----------------------------------------
ConeyLiu commented on code in PR #1174:
URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363996076
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##########
@@ -92,16 +95,21 @@ public FileEncryptionProperties
getFileEncryptionProperties() {
return fileEncryptionProperties;
}
+ public boolean prefetchBlockAllIndexes() {
+ return prefetchBlockAllIndexes;
+ }
+
// Builder to create a RewriterOptions.
public static class Builder {
- private Configuration conf;
- private List<Path> inputFiles;
- private Path outputFile;
+ private final Configuration conf;
+ private final List<Path> inputFiles;
+ private final Path outputFile;
private List<String> pruneColumns;
private CompressionCodecName newCodecName;
private Map<String, MaskMode> maskColumns;
private List<String> encryptColumns;
private FileEncryptionProperties fileEncryptionProperties;
+ private boolean prefetchBlockAllIndexes;
Review Comment:
Disabled by default to keep existing behaviors.
> Optimize random seek during rewriting
> -------------------------------------
>
> Key: PARQUET-2366
> URL: https://issues.apache.org/jira/browse/PARQUET-2366
> Project: Parquet
> Issue Type: Improvement
> Reporter: Xianyang Liu
> Priority: Major
>
> The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of
> the file. We need to randomly seek 4 times when rewriting a column chunk. We
> found this could impact the rewrite performance heavily for files with a
> number of columns(~1000). In this PR, we read the `ColumnIndex`,
> `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We
> got about 60 times performance improvement in production environments for the
> files with about one thousand columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)