joshhamann opened a new issue, #10822: URL: https://github.com/apache/hudi/issues/10822
**Describe the problem you faced** We have a production transform job using AWS Glue version 4.0, Hudi version 0.12.1 that loads data into a hudi table on s3. At some point, this job starting taking longer to run. I created a test job to point to the same raw data source, which is loading into a new Hudi table on s3, which completed much faster (5min vs 15min), in line with expectation. We are partitioning by date, and the volume of data has not changed. The job runs every 15 minutes, so the job duration is now becoming an issue. I noticed there are many files in these s3 locations on the prd transform: .hoodie/metadata/.hoodie .hoodie/archive I also noticed that both the production and test job seem to transform the data in the same amount of time (~5 minutes), but the production job then has many additional steps after `DirectWriteMarkers`, which take up the rest of the time difference. These steps are: FSUtils CleanPlanActionExecutor SparkUpsertDeltaCommitPartitioner SparkUpsertPreppedDeltaCommitActionExecutor Test Job:  Production Job with many more steps at the end:  **To Reproduce** 1. Run a job every 15 minutes for a long time, and let metadata/timeline/archive files build up 2. Create new test on same source data (which is processing MORE data given not using bookmarks) 3. Notice test job finishes quicker than actual production job **Expected behavior** I expect that given the structure of the hudi table, continually building up more days of data should not slow hudi down. I also expect there should be some configs to assist in cleanup. What configs can I set to alleviate these extra steps at the end that I am experiencing in production? **Environment Description** AWS Glue 4.0, Hudi 0.12.1, Spark 3.3.0 * Hudi version : 0.12.1 * Spark version : 3.3.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
