zhangyue19921010 edited a comment on pull request #4078:
URL: https://github.com/apache/hudi/pull/4078#issuecomment-992236967


   Hi @yihua Thanks a lot for your attention. 
   
   I agree with your opinion, indeed deleting archived files will lose 
historical instants information and affect some hudi functions such as 
HoodieRepairTool.
   The current implementation is a simple but can solve most of the problems, 
at least from my experience.
   
   At present, the user's use of hudi still involves the their own judgment, 
such as 
   1. How to make the cleaner delete the data, and the deletion of the data 
will affect the time travial. 
   2. How to make hudi archive instant, once instant is archived, it will 
affect the use of active timeline. 
   
   Sometimes users still need to have a clear understanding of their 
configuration, just as enable archive files number will lose historical instant 
information, same as time travial to cleaner.**(Of course we need to remind 
users in the document)**
   
   Fortunately, users have options to use this function according on their own 
circumstances. If users need to keep all instant information, just disable it. 
If the user does not care about the instant after the archive, they can turn it 
on and keep a smaller value. 
   
   On the other hand, I think the loss of information is inevitable, and we 
cannot keep all the data forever. The questions are when and how.
   
   Of course, the improvement you mentioned is very reasonable such as let 
hoodie implement append archive files function for unsupport-append dfs.
   
   Do you think we need to  get it done in this PR or maybe we can walk step by 
step to reach the final state. 
   By the way we need to take care of performance issue which is very important 
for Streaming Job.
   
   (1) Rewrite the archived timeline content into a smaller number of files --> 
will lead a archive file write amplify
   (2) When deleting the archived files, make sure the table does not have any 
corresponding base or log files from the contained instants, so there is 
essentially no information loss of the table states ---> maybe need to list and 
collect all the table data files name which is heavy for large hudi table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to