Haitham Eltaweel created HUDI-7332:
--------------------------------------

             Summary: The best way to force cleaning hoodie metadata
                 Key: HUDI-7332
                 URL: https://issues.apache.org/jira/browse/HUDI-7332
             Project: Apache Hudi
          Issue Type: Bug
          Components: cleaning, hudi-utilities, metadata
         Environment: Environment Description

Hudi version : 0.11.0

Spark version : 3.2.1

Amazon EMR : emr-6.11.1

Hadoop version : 3.2.1

Hive : 3.1.3

Storage (HDFS/S3/GCS..) : S3

Running on Docker? (yes/no) : No, yarn.
            Reporter: Haitham Eltaweel


We have spark structured streaming job writing data to hudi tables. After an 
upgrade to hudi 0.11, we found that we have thousands of files under hoodie 
metadata which were not archived. This impacts the overall processing of the 
streaming job. I found similar issue in 
https://github.com/apache/hudi/issues/7472 and they mentioned this was fixed in 
0.13. Since we have the issue in Prod, we will not be able to upgrade to 0.13 
for now. I found that I can run sperate spark submit job to execute 
HoodieCleaner. I also found that deleting hudie metadata from hudi-cli could be 
an option but I am not sure if its safe to use that approach as we are using 
upsert hudi operation in the streaming job. 
Please advise what is the best way to force cleaning and archiving the metadata 
files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to