Pramod Biligiri created HUDI-5688:
-------------------------------------
Summary: schema field of EmptyRelation subtype of BaseRelation
should not be null
Key: HUDI-5688
URL: https://issues.apache.org/jira/browse/HUDI-5688
Project: Apache Hudi
Issue Type: Bug
Components: core
Reporter: Pramod Biligiri
Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png,
3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png,
Main.java, pom.xml
If there are no completed instants in the table, and there is no user defined
schema for it as well (as represented by the userSpecifiedSchema field in
DataSource.scala), then the EmptyRelation returned by
DefaultSource.createRelation sets schema of the EmptyRelation to null. This
breaks the contract of Spark's BaseRelation, where the schema is a StructType
but is not expected to be null.
Module versions: current apache-hudi master (commit hash
abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
Following Hudi session reproduces the above issue:
spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
at
org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
at
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 50 elided
Find attached a few screenshots which show the code flow and the buggy state of
the variables. Also find attached a Java file and pom.xml that can be used to
reproduce the same (sorry don't have deanonymized table -to share yet).-
The bug seems to have been introduced in this particular PR change:
[https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
Initial work on that file has happened in this particular Jira
(https://issues.apache.org/jira/browse/HUDI-4363) and PR
(https://github.com/apache/hudi/pull/6046) respectively.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)