joshhamann opened a new issue, #10822:
URL: https://github.com/apache/hudi/issues/10822

   **Describe the problem you faced**
   We have a production transform job using AWS Glue version 4.0, Hudi version 
0.12.1 that loads data into a hudi table on s3.  At some point, this job 
starting taking longer to run.  I created a test job to point to the same raw 
data source, which is loading into a new Hudi table on s3, which completed much 
faster (5min vs 15min), in line with expectation.  We are partitioning by date, 
and the volume of data has not changed.  The job runs every 15 minutes, so the 
job duration is now becoming an issue.  I noticed there are many files in these 
s3 locations on the prd transform:
   
   .hoodie/metadata/.hoodie
   .hoodie/archive
   
   I also noticed that both the production and test job seem to transform the 
data in the same amount of time (~5 minutes), but the production job then has 
many additional steps after `DirectWriteMarkers`, which take up the rest of the 
time difference.  These steps are:
   
   FSUtils
   CleanPlanActionExecutor
   SparkUpsertDeltaCommitPartitioner
   SparkUpsertPreppedDeltaCommitActionExecutor
   
   Test Job:
   ![Screenshot 2024-03-05 at 9 46 53 
AM](https://github.com/apache/hudi/assets/12532529/d953850e-713c-4ccf-b59d-671e187cf709)
   
   Production Job with many more steps at the end:
   ![Screenshot 2024-03-05 at 9 47 22 
AM](https://github.com/apache/hudi/assets/12532529/2d0c46cf-c40f-4558-ae8a-4ef2a8683425)
   
   **To Reproduce**
   
   1. Run a job every 15 minutes for a long time, and let 
metadata/timeline/archive files build up
   2. Create new test on same source data (which is processing MORE data given 
not using bookmarks)
   3. Notice test job finishes quicker than actual production job
   
   **Expected behavior**
   
   I expect that given the structure of the hudi table, continually building up 
more days of data should not slow hudi down.  I also expect there should be 
some configs to assist in cleanup.  What configs can I set to alleviate these 
extra steps at the end that I am experiencing in production?
   
   **Environment Description**
   AWS Glue 4.0, Hudi 0.12.1, Spark 3.3.0
   
   * Hudi version :
   0.12.1
   
   * Spark version :
   3.3.0
   
   * Storage (HDFS/S3/GCS..) :
   S3
   
   * Running on Docker? (yes/no) :
   No
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to