[jira] [Commented] (PARQUET-2366) Optimize random seek during rewriting

ASF GitHub Bot (Jira) Wed, 18 Oct 2023 07:33:05 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776719#comment-17776719
 ]


ASF GitHub Bot commented on PARQUET-2366:
-----------------------------------------

ConeyLiu commented on code in PR #1174:
URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363996076


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##########
@@ -92,16 +95,21 @@ public FileEncryptionProperties 
getFileEncryptionProperties() {
     return fileEncryptionProperties;
   }
 
+  public boolean prefetchBlockAllIndexes() {
+    return prefetchBlockAllIndexes;
+  }
+
   // Builder to create a RewriterOptions.
   public static class Builder {
-    private Configuration conf;
-    private List<Path> inputFiles;
-    private Path outputFile;
+    private final Configuration conf;
+    private final List<Path> inputFiles;
+    private final Path outputFile;
     private List<String> pruneColumns;
     private CompressionCodecName newCodecName;
     private Map<String, MaskMode> maskColumns;
     private List<String> encryptColumns;
     private FileEncryptionProperties fileEncryptionProperties;
+    private boolean prefetchBlockAllIndexes;

Review Comment:
   Disabled by default to keep existing behaviors.





> Optimize random seek during rewriting
> -------------------------------------
>
>                 Key: PARQUET-2366
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2366
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of 
> the file. We need to randomly seek 4 times when rewriting a column chunk. We 
> found this could impact the rewrite performance heavily for files with a 
> number of columns(~1000). In this PR, we read the `ColumnIndex`, 
> `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We 
> got about 60 times performance improvement in production environments for the 
> files with about one thousand columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2366) Optimize random seek during rewriting

Reply via email to