Re: [PR] fix(spark): preserve Spark's native unresolved-column errors in HoodieAnalysis [hudi]

via GitHub Sun, 21 Jun 2026 23:48:13 -0700


nsivabalan commented on PR #18147:
URL: https://github.com/apache/hudi/pull/18147#issuecomment-4765630557


   Hi @nada-attia, picking this up to address @yihua's review.
   
   The original approach — catch `UnresolvedException` in 
`ProducesHudiMetaFields.unapply` and rewrite it as a generic Hudi error — does 
close out the cryptic message, but as @yihua [pointed 
out](https://github.com/apache/hudi/pull/18147#discussion_r2795763747) it 
intercepts Spark before `CheckAnalysis` runs, so we lose Spark's structured 
`UNRESOLVED_COLUMN.WITH_SUGGESTION` errors with "did you mean" hints. Reworked 
to follow @yihua's suggestion in 
[#discussion_r2796386493](https://github.com/apache/hudi/pull/18147#discussion_r2796386493):
 let Spark's analyzer surface its own error.
   
   Force-pushed (squashed the 6 commits) — diff is now small (3 files, ~190 
lines, mostly the test).
   
   ### Code changes
   
   **`HoodieAnalysis.ProducesHudiMetaFields.unapply`** — gate the output 
inspection on `resolved.resolved`. If the plan is still not fully resolved, 
return `None` so the pattern-match falls through; Spark's `CheckAnalysis` then 
runs in its normal phase and produces the precise error.
   
   ```diff
   - if (resolved.output.exists(attr => isMetaField(attr.name))) {
   + // If the plan is still not fully resolved (e.g., references non-existent
   + // tables or columns), fall through. Spark's CheckAnalysis runs later and
   + // produces precise UNRESOLVED_COLUMN / TABLE_OR_VIEW_NOT_FOUND errors with
   + // "did you mean" suggestions; intercepting UnresolvedException here would
   + // discard that context. Only inspect the output once the plan is resolved.
   + if (resolved.resolved && resolved.output.exists(attr => 
isMetaField(attr.name))) {
   ```
   
   **`HoodieSparkBaseAnalysis.ResolveReferences` (MIT case)** — if 
`sourceTable` is still unresolved after `analyzer.execute`, return the 
partially-resolved `MergeIntoTable` plan instead of descending into 
`resolveAssignments`. Spark's downstream rules then handle the error correctly.
   
   ```diff
     val sourceTable = if (sourceTableO.resolved) sourceTableO else 
analyzer.execute(sourceTableO)
     val m = mO.asInstanceOf[MergeIntoTable].copy(targetTable = targetTable, 
sourceTable = sourceTable)
   - EliminateSubqueryAliases(targetTable) match {
   + if (!sourceTable.resolved) {
   +   m  // let Spark's CheckAnalysis surface the unresolved-source error
   + } else EliminateSubqueryAliases(targetTable) match {
   ```
   
   ### Reverted from the prior version
   
   - `failUnresolvedQuery` on the `HoodieCatalystPlansUtils` trait and the four 
version-specific impls (3.3 / 3.4 / 3.5 / 4.0)
   - `collectUnresolvedReferences` helper in `HoodieAnalysis`
   - The `try/catch UnresolvedException` block in 
`ProducesHudiMetaFields.unapply`
   - The unused `UnresolvedException` import
   
   ### Tests
   
   Rewrote `TestHoodieAnalysisErrorHandling`:
   
   - 3 scenarios: MERGE INTO with unresolved column in source query, MERGE INTO 
with unresolved column in ON predicate, INSERT INTO from non-existent source 
table.
   - Assertions now check that Spark's native error patterns 
(`UNRESOLVED_COLUMN` / `TABLE_OR_VIEW_NOT_FOUND` / "cannot be resolved") are 
present **and** that the specific column/table name appears in the message.
   - Each test also asserts that the prior Hudi wrapper (`"Failed to resolve 
query"` / `"Please check for: (1) typos"`) is NOT present, so a future 
regression that re-introduces the rewrap fails the test.
   - Helpers accept both Spark 3.3 (pre-error-class) and Spark 3.4+ 
(error-class) phrasings, addressing the version-difference concern in 
[#discussion_r2795763752](https://github.com/apache/hudi/pull/18147#discussion_r2795763752).
   
   ### Rebase
   
   Rebased onto latest master (was 424 commits behind).
   
   @yihua @danny0405 PTAL.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(spark): preserve Spark's native unresolved-column errors in HoodieAnalysis [hudi]

Reply via email to