[GitHub] [hudi] boneanxs opened a new issue, #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

GitBox Thu, 13 Oct 2022 02:41:40 -0700


boneanxs opened a new issue, #6938:
URL: https://github.com/apache/hudi/issues/6938


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Say we have a spark streaming job write data to a Hudi table while disable 
`hoodie.clean.automatic` to avoid heavy clean operation to block the write 
speed, while we'll use a new clustering job to compact old small files as well 
as optimize file layout. 
   
   We also build another independent clean job(Use `KEEP_LASTEST_FILE_VERSION` 
to clean old version files, but sometimes as the replaceCommit could be 
archived by the write streaming job's archiver, so the clean job can not find 
any old version files, whereas the duplicates issue occurs.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. A streaming job which configures `hoodie.clean.automatic` to false,  
setting `hoodie.keep.min.commits` to 2 and `hoodie.keep.max.commits` equals 3 
to fast reproduce the issue.
   2. Start a clustering job to cluster some partitions
   3. Wait for the streaming job to write some commits to trigger the archiver 
to archive the commits, the replaceCommits will also be archived.
   4. The clean job will not clean files which should be clustered by the 
clustering job as there's no corresponding replacecommit.
   
   **Expected behavior**
   
   As sometimes it's necessary to disable the `hoodie.clean.automatic` to make 
the writing streaming job more stable(clean action could use much memory in the 
driver side to build the File view, especially when we have many small files 
and need to use `KEEP_LASTEST_FILE_VERSION` mode), here maybe archiver should 
be more smart to identify whether the replacecommit should be archived by 
checking the replaceFileId existing in the table or not, if original replace 
files exists in the table, which means no clean action triggered to delete the 
old version files?
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 3.1.2
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] boneanxs opened a new issue, #6938: [SUPPORT] HoodieTimelineArchiver could archive uncleaned replace commits causing duplicates

Reply via email to