fhan688 opened a new pull request, #19040:
URL: https://github.com/apache/hudi/pull/19040
### Describe the issue this Pull Request addresses
Clustering execution can fail when a requested clustering plan references
data files that are no longer valid, for example corrupted parquet files or
files that operators explicitly need to remove from the pending plan.
This PR adds a Spark SQL procedure to repair a requested clustering plan
by pruning selected invalid file slices from the plan before clustering
execution.
### Summary and Changelog
This PR adds `repair_clustering_plan`, a Spark SQL procedure for repairing
pending requested clustering plans.
Changes:
- Add `repair_clustering_plan` procedure.
- Support dry-run mode by default, returning the files that would be
removed from the requested clustering plan.
- Support user-specified file removal through `op => 'delete'` and
`invalid_parquet_files`.
- Support parquet validation through `op => 'validate_delete'`, which
scans files referenced by the clustering plan and detects invalid parquet files.
- Support optional physical file deletion through `need_delete => true`.
- Rewrite the requested clustering/replace instant metadata with the
repaired clustering plan.
- Create a backup of the original requested instant metadata by default
before rewriting.
- Prevent removing all input groups by default unless `allow_empty_plan =>
true` is explicitly set.
- Register the procedure in `HoodieProcedures`.
- Add tests covering dry-run, plan repair, backup creation, clustering
execution after repair, invalid parquet validation, optional physical deletion,
and procedure registration.
No code was copied.
### Impact
This adds a new user-facing Spark SQL procedure:
```sql
call repair_clustering_plan(
table => '<table>',
instant => '<requested_clustering_instant>',
op => 'delete',
invalid_parquet_files => '<file1,file2>',
dry_run => false
)
```
The change does not affect normal read/write/clustering behavior unless
the procedure is explicitly invoked.
The procedure can rewrite pending requested clustering metadata and can
delete physical files when need_delete => true is set. Dry-run is enabled by
default to avoid accidental mutation.
Validation mode scans files referenced by the requested clustering plan.
The scan parallelism is controlled by validation_parallelism.
### Risk Level
medium
The procedure performs metadata repair on requested clustering instants
and optionally deletes physical files, so misuse could remove entries from a
pending clustering plan. The risk is mitigated by:
- dry_run defaulting to true.
- metadata backup enabled by default.
- explicit instant selection.
- refusing to remove all input groups unless allow_empty_plan => true.
- tests covering dry-run, actual repair, post-repair clustering execution,
invalid parquet detection, and physical deletion.
### Documentation Update
Required.
This PR adds a new user-facing Spark SQL procedure. The Hudi website
procedure documentation should be updated to describe repair_clustering_plan,
including parameters, default values, dry-run behavior, backup behavior, and
examples for manual deletion and validation-based repair.
No new configs are added and no existing config defaults are changed.
### Contributor's checklist
- [x] Read through contributor's guide
(https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]