Jing Zhang created HUDI-7578:
--------------------------------

             Summary: Avoid unnecessary rewriting when copy old data from old 
base to new base file to improve compaction performance 
                 Key: HUDI-7578
                 URL: https://issues.apache.org/jira/browse/HUDI-7578
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Jing Zhang


Dear community,
After upgrade a hudi table from 0.10 version to 0.14 version, the compaction 
job become much more slower.
The hudi table is a MOR table without partition field. And the hudi table does 
not do any schema evolution.

The compaction job would finished in 52 minutes using 0.14 version.

<img width="2011" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/f825da15-5319-4ab2-9f0c-97741f4ea4f7";>

The compaction job would finished in 25 minutes using 0.10 version.
<img width="1974" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/223c3fc2-7991-40a0-8a86-5a949821c55e";>

And in the 0.14 version, the task jstack become much more complex. Including 
the following content:
<img width="1433" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/9394a3b4-3074-4ba5-bd07-7c73f195085f";>

After compare 0.14 and 0.10 version, we found there is a difference when copy 
the old record from old base file to new base file.
In 0.14 version, the cost is much more heavy.
<img width="1259" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/879b0f8e-dbc8-458b-9b45-afdced25580c";>
<img width="1354" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/d22835b2-7d6c-44ae-aaf1-967d1622c9ae";>
<img width="1438" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/438984f7-5d3f-4635-ae64-d3221d73cc34";>
<img width="1627" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/e1d5ddb4-1544-4f17-b9f9-6193765c8bed";>

In 0.10 version, the copy is more simple.
<img width="1421" alt="image" 
src="https://github.com/apache/hudi/assets/1525333/28eb2af7-e0f2-43b7-bfc7-f174e30cd944";>

 

Rewriting all fields value of each old record is not necessary, update new file 
path value and metadata fields are enough.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to