kbuci opened a new issue, #17907: URL: https://github.com/apache/hudi/issues/17907
### Feature Description **What the feature achieves:** Add new new write config values, which will only apply when writers are configured to use write conflict strategy as `PreferWriterConflictResolutionStrategy` - `wait_for_ingestion_inflight_attempts`: During write conflict resolution, if there are any non-table service writes that are only in a `REQUESTED` state with no workload profile, then fail with a write conflict exception. - `wait_for_ingestion_inflight_attempts`/`wait_for_ingestion_inflight_seconds`: After executing clustering plan, before proceeding with committing the clustering, re-load the active timeline to check if there are any non-table service writes in `REQUESTED` state. Until none such instants exist, reload the timeline `wait_for_ingestion_inflight_attempts` times waiting for `wait_for_ingestion_inflight_seconds` seconds between attempts. Once all attempts are exhausted, proceed with committing the cluster. Note that this polling does not happen while the table lock is held. In addition, if `PreferWriterConflictResolutionStrategy` is set then HUDI should forcibly override `hoodie.clustering.updates.strategy` to be `org.apache.hudi.client.clustering.update.strategy.SparkAllowUpdateStrategy`. Otherwise non-table service writes will still self-abort when a (inflight) clustering instant is present and targeting the same partition **Why this feature is needed:** We want to enforce that if an ingestion (`insert/upsert/insert_overwrite/etc`) write and clustering write target the same partition, the former will never fail due to a write conflict. We use `PreferWriterConflictResolutionStrategy` to achieve this, but the current implementation isn't sufficient for cases where we attempt to cluster older partitions for datasets with upserts (or small file handling enabled). In such scenarios, we want to ensure that the upsert write never fails, even if it means the clustering write repeatedly fails. We have implemented the above configurations and disable them for datasets which only do inserts (Without small file handling). We can upstream our changes once we reach consensus. ### User Experience **How users will use this feature:** - Configuration changes needed - API changes - Usage examples ### Hudi RFC Requirements **RFC PR link:** (if applicable) **Why RFC is/isn't needed:** - Does this change public interfaces/APIs? (Yes/No) - Does this change storage format? (Yes/No) - Justification: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
