maheshguptags commented on issue #7589:
URL: https://github.com/apache/hudi/issues/7589#issuecomment-1370489489

   Hi @yihua ,Thanks for looking into this. you are partially right but I want 
to preserve all the clustered file from 1 clustered file to till very end of 
the pipeline.
   
   let me give you the example.
   
   step 1 image : it contains the 3 commit of the file 
   
![image](https://user-images.githubusercontent.com/115445723/210486056-13791e19-9caa-4a8d-abba-85a627ef38f2.png)
   
   step 2 image contain after the clustering file :
   
![image](https://user-images.githubusercontent.com/115445723/210486422-0b5c03ce-1179-4e0b-b815-3fadd1b41bf0.png)
   
   
   step 3 it contains only the clustered file and the latest commit files
   
![image](https://user-images.githubusercontent.com/115445723/210486522-8ac6d6d7-868f-42a5-bae7-f52825e47f41.png)
   
   step 4 image  inserted few more commit then perform the clustering   
   
![image](https://user-images.githubusercontent.com/115445723/210486823-f52715e0-5908-47a7-a586-325803df0edc.png)
   Step 5 Image : Now this time clustering and cleaning will be triggered (4 
commits completed) so it will clean the last cluster file (size exactly 8.4MB) 
and create new cluster file having all updated data. Whereas my wish is to 
preserve the old clustered file (8.4MB) file and create new clustered file. 
This way I will be able to maintain historical data of my process.
   
![image](https://user-images.githubusercontent.com/115445723/210487314-7089f256-d755-4f5e-8a0e-f12759e2d821.png)
   
   I hope I am able to explain the use case, If not we can quick catch up on 
call.
   Let me know your thought on this 
   Thanks!!  
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to