prashantwason commented on code in PR #18669:
URL: https://github.com/apache/hudi/pull/18669#discussion_r3203388257
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala:
##########
@@ -134,7 +134,19 @@ class DefaultSource extends RelationProvider
parameters
}
- val relation = DefaultSource.createRelation(sqlContext, metaClient,
schema, options.toMap)
+ // Spark's DataSource.resolveRelation() invokes this 3-arg overload
directly via the
+ // SchemaRelationProvider path when a user-supplied schema is present (e.g.
+ // spark.read.schema(...).load(path)). The 2-arg overload catches
+ // HoodieSchemaNotFoundException and returns an EmptyRelation, but that
catch is bypassed
+ // on this path, so we mirror the same handling here. Preserve the
caller-supplied schema
+ // so subsequent query analysis (e.g. column resolution in WHERE clauses)
sees the
+ // HMS-known columns even though the on-disk table is schemaless.
+ val relation = try {
+ DefaultSource.createRelation(sqlContext, metaClient, schema,
options.toMap)
+ } catch {
+ case _: HoodieSchemaNotFoundException =>
+ new EmptyRelation(sqlContext, Option(schema).getOrElse(new
StructType()))
Review Comment:
Update: had to revert the simplification in 5e25a570dd0f. Turns out the
2-arg createRelation overload (line 78) re-enters this 3-arg method with
schema=null, so the SchemaRelationProvider non-null contract assumption doesn't
hold for internal callers. The defensive Option(schema).getOrElse(new
StructType()) was load-bearing - removing it broke
TestCOWDataSource.testReadOfAnEmptyTable on spark3.3 / spark3.5 with NPE in
BaseRelation.schema().isEmpty. Comment now documents the internal-recursion
reason.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]