[ 
https://issues.apache.org/jira/browse/PARQUET-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2366.
------------------------------
    Fix Version/s: 1.14.0
         Assignee: Xianyang Liu
       Resolution: Fixed

> Optimize random seek during rewriting
> -------------------------------------
>
>                 Key: PARQUET-2366
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2366
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Xianyang Liu
>            Assignee: Xianyang Liu
>            Priority: Major
>             Fix For: 1.14.0
>
>
> The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of 
> the file. We need to randomly seek 4 times when rewriting a column chunk. We 
> found this could impact the rewrite performance heavily for files with a 
> number of columns(~1000). In this PR, we read the `ColumnIndex`, 
> `OffsetIndex`, and `BloomFilter` into a cache to avoid the random seek. We 
> got about 60 times performance improvement in production environments for the 
> files with about one thousand columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to