venkee14 opened a new issue #1763:
URL: https://github.com/apache/hudi/issues/1763


   I have noticed that the individual jobs runtime in Spark UI server does not 
add up to the total upsert time taken. I am trying to understand where the 
extra time is spent and reduce it and make the upsert run faster.
   
   We have recently increased the hoodie.cleaner.commits.retained=250 number 
for this table to a higher value(250), Could it be due to this? We might want 
to increase this number even more, Since we would want to be able to do an 
incremental query going few weeks back, We do a batch upsert into the Hudi 
table every 10 mins.
   
   Spark UI shows total Uptime - 7.6 min
   Upsert Time from logs - 20/06/24 01:32:51 INFO metrics: type=GAUGE, 
name=AR_PAYMENT_SCHEDULES_ALL.commit.totalUpsertTime, value=488623
   Individual Job times added together - ~3.4 min
   
   Env:
   
   EMR Version - 5.28
   Hudi Version - 0.5.1
   Spark Version - 2.2.1
   
   I have attached the upsert job log, Spark UI screenshot.
   [Uploading logs.txt…]()
   <img width="1136" alt="Screen Shot 2020-06-23 at 7 21 03 PM" 
src="https://user-images.githubusercontent.com/12746240/85494664-f560cf00-b58d-11ea-92ff-820393e84216.png";>
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to