Haitham Eltaweel created HUDI-7332:
--------------------------------------
Summary: The best way to force cleaning hoodie metadata
Key: HUDI-7332
URL: https://issues.apache.org/jira/browse/HUDI-7332
Project: Apache Hudi
Issue Type: Bug
Components: cleaning, hudi-utilities, metadata
Environment: Environment Description
Hudi version : 0.11.0
Spark version : 3.2.1
Amazon EMR : emr-6.11.1
Hadoop version : 3.2.1
Hive : 3.1.3
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No, yarn.
Reporter: Haitham Eltaweel
We have spark structured streaming job writing data to hudi tables. After an
upgrade to hudi 0.11, we found that we have thousands of files under hoodie
metadata which were not archived. This impacts the overall processing of the
streaming job. I found similar issue in
https://github.com/apache/hudi/issues/7472 and they mentioned this was fixed in
0.13. Since we have the issue in Prod, we will not be able to upgrade to 0.13
for now. I found that I can run sperate spark submit job to execute
HoodieCleaner. I also found that deleting hudie metadata from hudi-cli could be
an option but I am not sure if its safe to use that approach as we are using
upsert hudi operation in the streaming job.
Please advise what is the best way to force cleaning and archiving the metadata
files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)