[ https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776567#comment-17776567 ]
ASF GitHub Bot commented on PARQUET-2366: ----------------------------------------- wgtmac commented on code in PR #1174: URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363512338 ########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException { BlockMetaData blockMetaData = meta.getBlocks().get(blockId); List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns(); + List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, columnsInOrder, descriptorsMap); Review Comment: > Do you mean to read the indexes column by column to reduce memory footprint? No, my suggested interface does not restrict any implementation detail, at least they should be ready at the `readXXX()` call. You can still read all indexes at once (controlled by a config). We can configurated it to release any consumed index object to reduce memory footprint. > Optimize random seek during rewriting > ------------------------------------- > > Key: PARQUET-2366 > URL: https://issues.apache.org/jira/browse/PARQUET-2366 > Project: Parquet > Issue Type: Improvement > Reporter: Xianyang Liu > Priority: Major > > The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of > the file. We need to randomly seek 4 times when rewriting a column chunk. We > found this could impact the rewrite performance heavily for files with a > number of columns(~1000). In this PR, we read the `ColumnIndex`, > `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We > got about 60 times performance improvement in production environments for the > files with about one thousand columns. -- This message was sent by Atlassian Jira (v8.20.10#820010)