[ 
https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776170#comment-17776170
 ] 

ASF GitHub Bot commented on PARQUET-2366:
-----------------------------------------

ConeyLiu commented on code in PR #1174:
URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1361986315


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException {
       BlockMetaData blockMetaData = meta.getBlocks().get(blockId);
       List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns();
 
+      List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, 
columnsInOrder, descriptorsMap);

Review Comment:
   I could add an option to it if someone is concerned about memory usage. This 
only caches the metadata for only one block and should be smaller than doing 
file writing which needs to cache all blocks' metadata.





> Optimize random seek during rewriting
> -------------------------------------
>
>                 Key: PARQUET-2366
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2366
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of 
> the file. We need to randomly seek 4 times when rewriting a column chunk. We 
> found this could impact the rewrite performance heavily for files with a 
> number of columns(~1000). In this PR, we read the `ColumnIndex`, 
> `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We 
> got about 60 times performance improvement in production environments for the 
> files with about one thousand columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to