bvaradar commented on issue #2240: URL: https://github.com/apache/hudi/issues/2240#issuecomment-730037189
Just looking at the timestamps for last compaction, clean and delta commit operations, 1. Compaction: 2020-11-17 16:36:41 5097604 20201117163409.commit 2020-11-17 16:34:21 0 20201117163409.compaction.inflight THis means that compaction took around 2.3 mins to finish Ingestion: 2020-11-17 16:34:02 5274215 20201117162434.deltacommit 2020-11-17 16:30:26 3916496 20201117162434.deltacommit.inflight 2020-11-17 16:26:00 0 20201117162434.deltacommit.requested The data write part alone took like ~8 mins Cleaning: 2020-11-17 16:49:44 1798629 20201117162434.clean 2020-11-17 16:26:53 1665412 20201117162434.clean.inflight 2020-11-17 16:26:52 1665412 20201117162434.clean.requested This seems to have taken like 25 mins. With a single partition, the listing performance should be similar for ingestion and cleaning. If cleaning was run asynchronously, you might need to give more executors to run parallel clean. I am also wondering if the deletes are getting throttled in your case which is slowing down clean. Can you take a look at executor logs to see if each file deletion is taking long time. You can also try disabling cleaning "hoodie.clean.automatic=false" and enable it every N writes if the slowdown is not due to deletes throttling. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
