Jing Zhang created HUDI-7578:
--------------------------------
Summary: Avoid unnecessary rewriting when copy old data from old
base to new base file to improve compaction performance
Key: HUDI-7578
URL: https://issues.apache.org/jira/browse/HUDI-7578
Project: Apache Hudi
Issue Type: Improvement
Reporter: Jing Zhang
Dear community,
After upgrade a hudi table from 0.10 version to 0.14 version, the compaction
job become much more slower.
The hudi table is a MOR table without partition field. And the hudi table does
not do any schema evolution.
The compaction job would finished in 52 minutes using 0.14 version.
<img width="2011" alt="image"
src="https://github.com/apache/hudi/assets/1525333/f825da15-5319-4ab2-9f0c-97741f4ea4f7">
The compaction job would finished in 25 minutes using 0.10 version.
<img width="1974" alt="image"
src="https://github.com/apache/hudi/assets/1525333/223c3fc2-7991-40a0-8a86-5a949821c55e">
And in the 0.14 version, the task jstack become much more complex. Including
the following content:
<img width="1433" alt="image"
src="https://github.com/apache/hudi/assets/1525333/9394a3b4-3074-4ba5-bd07-7c73f195085f">
After compare 0.14 and 0.10 version, we found there is a difference when copy
the old record from old base file to new base file.
In 0.14 version, the cost is much more heavy.
<img width="1259" alt="image"
src="https://github.com/apache/hudi/assets/1525333/879b0f8e-dbc8-458b-9b45-afdced25580c">
<img width="1354" alt="image"
src="https://github.com/apache/hudi/assets/1525333/d22835b2-7d6c-44ae-aaf1-967d1622c9ae">
<img width="1438" alt="image"
src="https://github.com/apache/hudi/assets/1525333/438984f7-5d3f-4635-ae64-d3221d73cc34">
<img width="1627" alt="image"
src="https://github.com/apache/hudi/assets/1525333/e1d5ddb4-1544-4f17-b9f9-6193765c8bed">
In 0.10 version, the copy is more simple.
<img width="1421" alt="image"
src="https://github.com/apache/hudi/assets/1525333/28eb2af7-e0f2-43b7-bfc7-f174e30cd944">
Rewriting all fields value of each old record is not necessary, update new file
path value and metadata fields are enough.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)