[
https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776461#comment-17776461
]
ASF GitHub Bot commented on PARQUET-2366:
-----------------------------------------
wgtmac commented on code in PR #1174:
URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363041453
##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException {
BlockMetaData blockMetaData = meta.getBlocks().get(blockId);
List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns();
+ List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader,
columnsInOrder, descriptorsMap);
Review Comment:
Thanks for adding this! The change looks reasonable to me. I would suggest
adding a new class to specifically cache and read these indexes. The new class
have methods like `readBloomFilter()`, `readColumnIndex()` and
`readOffsetIndex()` for a specific column path, and can be configured to cache
required columns in advance. With this new class, we can do more optimizations
including evict consumed items out of cache and use async I/O prefetch to load
items. We can split them into separate patches. For the first one, we may
simply add the new class without any caching (i.e. no behavior change). WDYT?
> Optimize random seek during rewriting
> -------------------------------------
>
> Key: PARQUET-2366
> URL: https://issues.apache.org/jira/browse/PARQUET-2366
> Project: Parquet
> Issue Type: Improvement
> Reporter: Xianyang Liu
> Priority: Major
>
> The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of
> the file. We need to randomly seek 4 times when rewriting a column chunk. We
> found this could impact the rewrite performance heavily for files with a
> number of columns(~1000). In this PR, we read the `ColumnIndex`,
> `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We
> got about 60 times performance improvement in production environments for the
> files with about one thousand columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)