kbuci opened a new issue, #17903:
URL: https://github.com/apache/hudi/issues/17903

   ### Feature Description
   
   **What the feature achieves:**
   Add clustering configs that prevent `scheduleClustering` from creating a new 
plan if
   - There are too many unarchived/uncompacted instants in data table
   - There are too many uncompacted instants in metadata table (MDT)
   
   **Why this feature is needed:**
   - If too many writes on a dataset are accumulated before clean and archival 
are executed on a dataset again, then the internal timeline may have thousands 
of instants. Additionally, clustering will cause the dataset partitions to 
contain replaced uncleaned older file groups, potentially leading to an 
uncleaned files building up.
   - HUDI datasets need to undergo compaction on MDT once there have been 
enough writes accumulated. Any write or table service operation on the data 
table will cause a write on the metadata table. Delays in compaction will cause 
all writers to take more time in building their internal filesystem view when 
reading the metadata table, due to having to processing all uncompacted files. 
There is also an indirect impact of lock contention: clustering and write 
operations need to create a filesystem view while the lock is acquired here  
(in 
`org.apache.hudi.client.BaseHoodieTableServiceClient#scheduleTableServiceInternal`)
 - this will increase the time the lock is held and can cause other concurrent 
writers to get delayed/fail due to waiting for the table lock.
   
   
   ### User Experience
   
   **How users will use this feature:**
   - Configuration changes needed
   - API changes
   - Usage examples
   
   
   ### Hudi RFC Requirements
   
   **RFC PR link:** (if applicable)
   
   **Why RFC is/isn't needed:**
   - Does this change public interfaces/APIs? (Yes/No)
   - Does this change storage format? (Yes/No)
   - Justification:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to