[jira] [Commented] (PARQUET-2366) Optimize random seek during rewriting

ASF GitHub Bot (Jira) Wed, 18 Oct 2023 00:50:35 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776536#comment-17776536
 ]


ASF GitHub Bot commented on PARQUET-2366:
-----------------------------------------

ConeyLiu commented on code in PR #1174:
URL: https://github.com/apache/parquet-mr/pull/1174#discussion_r1363403671


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##########
@@ -265,6 +265,10 @@ private void processBlocksFromReader() throws IOException {
       BlockMetaData blockMetaData = meta.getBlocks().get(blockId);
       List<ColumnChunkMetaData> columnsInOrder = blockMetaData.getColumns();
 
+      List<ColumnIndex> columnIndexes = readAllColumnIndexes(reader, 
columnsInOrder, descriptorsMap);

Review Comment:
   @wgtmac thanks for your suggestions. Do you mean to read the indexes column 
by column to reduce memory footprint? The suggested way should have less memory 
usage. The indexes are stored as the following from my understanding:
   ```
   // column index
   block1_col1_column_index
   ...
   block1_coln_column_index
   block2_col1_column_index
   ...
   block2_coln_column_index
   ...
   
   // offset index
   block1_col1_offset_index
   ...
   block1_coln_offset_index
   block2_col1_offset_index
   ...
   block2_coln_offset_index
   ...
   
   // bloom index
   block1_col1_bloom_index
   ...
   block1_coln_bloom_index
   block2_col1_bloom_index
   ...
   block2_coln_bloom_index
   ...
   ```
   
   So the problem would be we still need to do random seek for a single 
rowgroup(3 * number of columns). The async I/O should be helpful for the random 
seek performance. With this PR, we only need 3 times random seek (except the 
column pruning) for a single rowgroup. 
   





> Optimize random seek during rewriting
> -------------------------------------
>
>                 Key: PARQUET-2366
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2366
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Xianyang Liu
>            Priority: Major
>
> The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of 
> the file. We need to randomly seek 4 times when rewriting a column chunk. We 
> found this could impact the rewrite performance heavily for files with a 
> number of columns(~1000). In this PR, we read the `ColumnIndex`, 
> `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We 
> got about 60 times performance improvement in production environments for the 
> files with about one thousand columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (PARQUET-2366) Optimize random seek during rewriting

Reply via email to