harshagudladona opened a new issue, #14070:
URL: https://github.com/apache/hudi/issues/14070

   ### Bug Description
   
   **Environment:**
   
   Spark 3.5
   Java 11
   Hudi 0.14 and 1.0.2
   Storage S3
   
   **What happened:**
   We are noticing about 20-30% write performance degradation between Hudi 0.14 
and 1.0.2
   
   We have profiled the executors by writing the same data between executors 
running both versions, and here is the first of few suspects.  We will open a 
separate bug report as we find more suspects.
   
   Small file Handling:
   
   In both versions, the HoodieMergeHelper.runMerge function, the following 
condition ends up being false, without any change to the writer schema during 
the test
   
   ```
       // Check whether the writer schema is simply a projection of the file's 
one, ie
       //   - Its field-set is a proper subset (of the reader schema)
       //   - There's no schema evolution transformation necessary
       boolean isPureProjection = schemaEvolutionTransformerOpt.isEmpty()
           && isStrictProjectionOf(readerSchema, writerSchema);
   ```
   
   Due to this the record is always rewritten with function 
rewriteRecordWithNewSchema
   
   ```
         executor = ExecutorFactory.create(writeConfig, recordIterator, new 
UpdateHandler(mergeHandle), record -> {
           HoodieRecord newRecord;
           if (schemaEvolutionTransformerOpt.isPresent()) {
             newRecord = schemaEvolutionTransformerOpt.get().apply(record);
           } else if (shouldRewriteInWriterSchema) {
             newRecord = record.rewriteRecordWithNewSchema(recordSchema, 
writeConfig.getProps(), writerSchema);
           } else {
             newRecord = record;
           }
   ```
   In this function in hudi 1.x in HoodieAvroUtils.rewriteRecordWithNewSchema, 
the following check is added, which seems to be taking significant CPU time. 
   
   ```
       if (oldAvroSchema.equals(newSchema)) {
         // there is no need to rewrite if the schema equals.
         return oldRecord;
       }
       // 
   ```
   1.x flamegraph 
   
   <img width="2325" height="526" alt="Image" 
src="https://github.com/user-attachments/assets/4e934182-b445-40dc-be10-40e90bb9af30";
 />
   
   0.14 flamegraph
   
   <img width="2326" height="587" alt="Image" 
src="https://github.com/user-attachments/assets/fc132bef-99e1-4330-8fe6-6928cd48173d";
 />
   
   hostspot in 1.0.2
   
   <img width="2332" height="642" alt="Image" 
src="https://github.com/user-attachments/assets/98e63658-aabb-4ad9-819b-cc05703eecc5";
 />
   
   
   **What you expected:**
   
   No performance degradation
   
   **Steps to reproduce:**
   1. Write any dataset in 0.14 and 1.0.2 and observe the difference.
   
   ### Environment
   
   **Hudi version: 0.14 and 1.0.2**
   **Query engine: Spark** (Spark/Flink/Trino etc)
   **Relevant configs: Same configurations as the one reported here: 
https://github.com/apache/hudi/issues/13995**
   
   
   ### Logs and Stack Trace
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to