[PR] feat(spark): add repair_clustering_plan procedure for requested clustering plans [hudi]

via GitHub Thu, 18 Jun 2026 07:05:45 -0700


fhan688 opened a new pull request, #19040:
URL: https://github.com/apache/hudi/pull/19040


   ### Describe the issue this Pull Request addresses
   
     Clustering execution can fail when a requested clustering plan references 
data files that are no longer valid, for example corrupted parquet files or 
files that operators explicitly need to remove from the pending plan.
   
     This PR adds a Spark SQL procedure to repair a requested clustering plan 
by pruning selected invalid file slices from the plan before clustering 
execution.
   
   ### Summary and Changelog
   
     This PR adds `repair_clustering_plan`, a Spark SQL procedure for repairing 
pending requested clustering plans.
   
     Changes:
     - Add `repair_clustering_plan` procedure.
     - Support dry-run mode by default, returning the files that would be 
removed from the requested clustering plan.
     - Support user-specified file removal through `op => 'delete'` and 
`invalid_parquet_files`.
     - Support parquet validation through `op => 'validate_delete'`, which 
scans files referenced by the clustering plan and detects invalid parquet files.
     - Support optional physical file deletion through `need_delete => true`.
     - Rewrite the requested clustering/replace instant metadata with the 
repaired clustering plan.
     - Create a backup of the original requested instant metadata by default 
before rewriting.
     - Prevent removing all input groups by default unless `allow_empty_plan => 
true` is explicitly set.
     - Register the procedure in `HoodieProcedures`.
     - Add tests covering dry-run, plan repair, backup creation, clustering 
execution after repair, invalid parquet validation, optional physical deletion, 
and procedure registration.
   
     No code was copied.
   
   ### Impact
   
     This adds a new user-facing Spark SQL procedure:
   
     ```sql
     call repair_clustering_plan(
       table => '<table>',
       instant => '<requested_clustering_instant>',
       op => 'delete',
       invalid_parquet_files => '<file1,file2>',
       dry_run => false
     )
   ```
     The change does not affect normal read/write/clustering behavior unless 
the procedure is explicitly invoked.
   
     The procedure can rewrite pending requested clustering metadata and can 
delete physical files when need_delete => true is set. Dry-run is enabled by 
default to avoid accidental mutation.
   
     Validation mode scans files referenced by the requested clustering plan. 
The scan parallelism is controlled by validation_parallelism.
   
   
   ### Risk Level
   
     medium
   
     The procedure performs metadata repair on requested clustering instants 
and optionally deletes physical files, so misuse could remove entries from a 
pending clustering plan. The risk is mitigated by:
   
     - dry_run defaulting to true.
     - metadata backup enabled by default.
     - explicit instant selection.
     - refusing to remove all input groups unless allow_empty_plan => true.
     - tests covering dry-run, actual repair, post-repair clustering execution, 
invalid parquet detection, and physical deletion.
   
   ### Documentation Update
   
     Required.
   
     This PR adds a new user-facing Spark SQL procedure. The Hudi website 
procedure documentation should be updated to describe repair_clustering_plan, 
including parameters, default values, dry-run behavior, backup behavior, and 
examples for manual deletion and validation-based repair.
   
     No new configs are added and no existing config defaults are changed.
   
   ### Contributor's checklist
   
     - [x] Read through contributor's guide 
(https://hudi.apache.org/contribute/how-to-contribute)
     - [x] Enough context is provided in the sections above
     - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(spark): add repair_clustering_plan procedure for requested clustering plans [hudi]

Reply via email to