ConeyLiu commented on code in PR #1174: URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363403671
########## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ########## @@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException { BlockMetaData blockMetaData = meta.getBlocks().get(blockId); List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns(); + List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, columnsInOrder, descriptorsMap); Review Comment: @wgtmac thanks for your suggestions. Do you mean to read the indexes column by column to reduce memory footprint? The suggested way should have less memory usage. The indexes are stored as the following from my understanding: ``` // column index block1_col1_column_index ... block1_coln_column_index block2_col1_column_index ... block2_coln_column_index ... // offset index block1_col1_offset_index ... block1_coln_offset_index block2_col1_offset_index ... block2_coln_offset_index ... // bloom index block1_col1_bloom_index ... block1_coln_bloom_index block2_col1_bloom_index ... block2_coln_bloom_index ... ``` So the problem would be we still need to do random seek for a single rowgroup(3 * number of columns). The async I/O should be helpful for the random seek performance. With this PR, we only need 3 times random seek (except the column pruning) for a single rowgroup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org