kbuci opened a new pull request, #18128:
URL: https://github.com/apache/hudi/pull/18128

   ### Describe the issue this Pull Request addresses
   
   When a pre-commit SQL validator (e.g. 
`SqlQuerySingleResultPreCommitValidator`) is configured and a write produces no 
new base files (empty write via insert/upsert/bulkInsert with empty data), 
validation fails with:
   
   `AnalysisException: A column or function parameter with name '_row_key' 
cannot be resolved`
   
   Cause: for empty pending commits, 
`SparkValidatorUtils.getRecordsFromPendingCommits` returns 
`sqlContext.emptyDataFrame()`, which has no columns. The validator then runs 
its SQL (e.g. `SELECT _row_key FROM <TABLE_NAME> ...`) against this schema-less 
DataFrame and Spark fails because `_row_key` does not exist.
   
   https://github.com/apache/hudi/issues/17923 
   
   ### Summary and Changelog
   
   **Summary:** For empty writes pre-commit validators now receive an empty 
DataFrame that has the table schema. Validators that reference columns (e.g. 
`_row_key`) no longer fail with an unresolved column error.
   
   Note that without the fix included in this PR, the test in this PR fails with
   
   `Caused by: java.util.concurrent.CompletionException: 
org.apache.spark.sql.AnalysisException: [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] 
A column or function parameter with name `_row_key` cannot be resolved. ; line 
1 pos 39;
         +- 'Aggregate ['_row_key], ['_row_key]
   Caused by: org.apache.spark.sql.AnalysisException: 
[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column or function parameter with name 
`_row_key` cannot be resolved. ; line 1 pos 39;
         +- 'Aggregate ['_row_key], ['_row_key]
        at 
org.apache.spark.sql.errors.QueryCompilationErrors$.unresolvedAttributeError(QueryCompilationErrors.scala:306)
        at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)`
   
   **Changelog:**
   - **SparkValidatorUtils.getRecordsFromPendingCommits**: When 
`newFiles.isEmpty()`, return an empty DataFrame with schema from 
`TableSchemaResolver.getTableSchema()` (via 
`HoodieSchemaConversionUtils.convertHoodieSchemaToStructType`) instead of 
`sqlContext.emptyDataFrame()`. On schema resolution failure, log a warning and 
fall back to schema-less empty DataFrame.
   - **TestSparkValidatorUtils**: Added `testSqlQueryValidatorWithNoRecords()` 
which (1) does one commit with data so the table has schema and data, (2) does 
an empty insert with a SQL validator that references `_row_key` 
(startCommitWithTime + insert(empty RDD) so the executor and validators run), 
(3) asserts the commit succeeds and the timeline has two commits. Ensures the 
empty-write + validator path is covered.
   
   
   ### Impact
   
   - **User-facing:** Pre-commit SQL validators that reference column names 
(e.g. `_row_key`) now succeed when the commit has no new files (empty write via 
insert/upsert/bulkInsert). No API or config changes.
   - **Performance:** Negligible; one schema lookup and one empty DataFrame 
creation when there are no new files and validators are configured.
   
   ### Risk Level
   
   **Low.** Change is limited to the “no new files” branch in 
`getRecordsFromPendingCommits` and reuses existing table schema resolution; 
fallback remains the previous behavior (schema-less empty DataFrame). New test 
confirms the executor-based empty-write path runs the validator and passes.
   
   ### Documentation Update
   
   None. No new configs, no change to public API or docs.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to