weimingdiit commented on PR #9313:
URL: https://github.com/apache/hudi/pull/9313#issuecomment-1656990702

   My test report
   
   The amount of data, memory, and gc parameters are consistent.
   
   before optimization:gc accounts for 33.52% of the overall sampling,and spark 
write stage use 14min
   
   Flame Graph:
   
![20230727-154250](https://github.com/apache/hudi/assets/23093701/3ea3565a-546f-435f-b8fb-a8b6c4b00431)
   sparkUI:
   
![before](https://github.com/apache/hudi/assets/23093701/56bdf319-ac22-4ebf-9037-8661f45543d4)
   
   after optimization:gc accounts for 9.5% of the overall sampling,and spark 
write stage use 9.3min
   Flame Graph:
   
![20230729-185852](https://github.com/apache/hudi/assets/23093701/ee981327-ffa3-4d20-9018-dff24b047727)
   sparkUI:
   
![after](https://github.com/apache/hudi/assets/23093701/022faacf-52e0-41a9-bd9b-ddeae24858f4)
   
   Summarize:
   GC frequency can be reduced by 24% under large amount of data
   
   note: 
   Remarks: We found that the previous code was implemented using 
Hex.encodeHex. It is not clear why it was modified to use the String.format 
method?
   because of "No longer depends on incl commons-codec, commons-io, 
commons-pool, commons-dbcp, commons-lang, commons-logging, avro-mapred" ?
   see:
   https://github.com/apache/hudi/pull/873
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to