kbuci opened a new issue, #17907:
URL: https://github.com/apache/hudi/issues/17907

   ### Feature Description
   
   **What the feature achieves:**
   
   Add new new write config values, which will only apply when writers are 
configured to use write conflict strategy as 
`PreferWriterConflictResolutionStrategy`
   - `wait_for_ingestion_inflight_attempts`: During write conflict resolution, 
if there are any non-table service writes that are only in a `REQUESTED` state 
with no workload profile, then fail with a write conflict exception.
   - 
`wait_for_ingestion_inflight_attempts`/`wait_for_ingestion_inflight_seconds`:  
After executing clustering plan, before proceeding with committing the 
clustering, re-load the active timeline to check if there are any  non-table 
service writes in `REQUESTED` state. Until none such instants exist, reload the 
timeline `wait_for_ingestion_inflight_attempts` times waiting for 
`wait_for_ingestion_inflight_seconds` seconds between attempts. Once all 
attempts are exhausted, proceed with committing the cluster. Note that this 
polling does not happen while the table lock is held.
   
   In addition, if `PreferWriterConflictResolutionStrategy` is set then HUDI 
should forcibly override `hoodie.clustering.updates.strategy` to be 
`org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy`. 
Otherwise non-table service writes will still self-abort when a (inflight) 
clustering instant is present and targeting the same partition
   
   **Why this feature is needed:**
   We want to enforce that if an ingestion 
(`insert/upsert/insert_overwrite/etc`) write and clustering write target the 
same partition, the former will never fail due to a write conflict. We use 
`PreferWriterConflictResolutionStrategy` to achieve this, but the current 
implementation isn't sufficient for cases where we attempt to cluster older 
partitions for datasets with upserts (or small file handling enabled). In such 
scenarios, we want to ensure that the upsert write never fails, even if it 
means the clustering write repeatedly fails. We have implemented the above 
configurations and disable them for datasets which only do inserts (Without 
small file handling). We can upstream our changes once we reach consensus. 
   
   ### User Experience
   
   **How users will use this feature:**
   - Configuration changes needed
   - API changes
   - Usage examples
   
   
   ### Hudi RFC Requirements
   
   **RFC PR link:** (if applicable)
   
   **Why RFC is/isn't needed:**
   - Does this change public interfaces/APIs? (Yes/No)
   - Does this change storage format? (Yes/No)
   - Justification:
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to