venkee14 opened a new issue #1763: URL: https://github.com/apache/hudi/issues/1763
I have noticed that the individual jobs runtime in Spark UI server does not add up to the total upsert time taken. I am trying to understand where the extra time is spent and reduce it and make the upsert run faster. We have recently increased the hoodie.cleaner.commits.retained=250 number for this table to a higher value(250), Could it be due to this? We might want to increase this number even more, Since we would want to be able to do an incremental query going few weeks back, We do a batch upsert into the Hudi table every 10 mins. Spark UI shows total Uptime - 7.6 min Upsert Time from logs - 20/06/24 01:32:51 INFO metrics: type=GAUGE, name=AR_PAYMENT_SCHEDULES_ALL.commit.totalUpsertTime, value=488623 Individual Job times added together - ~3.4 min Env: EMR Version - 5.28 Hudi Version - 0.5.1 Spark Version - 2.2.1 I have attached the upsert job log, Spark UI screenshot. [Uploading logs.txt…]() <img width="1136" alt="Screen Shot 2020-06-23 at 7 21 03 PM" src="https://user-images.githubusercontent.com/12746240/85494664-f560cf00-b58d-11ea-92ff-820393e84216.png"> ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
