Pramod Biligiri created HUDI-5688:
-------------------------------------

             Summary: schema field of EmptyRelation subtype of BaseRelation 
should not be null
                 Key: HUDI-5688
                 URL: https://issues.apache.org/jira/browse/HUDI-5688
             Project: Apache Hudi
          Issue Type: Bug
          Components: core
            Reporter: Pramod Biligiri
         Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 
3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, 
Main.java, pom.xml

If there are no completed instants in the table, and there is no user defined 
schema for it as well (as represented by the userSpecifiedSchema field in 
DataSource.scala), then the EmptyRelation returned by 
DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
breaks the contract of Spark's BaseRelation, where the schema is a StructType 
but is not expected to be null.

Module versions: current apache-hudi master (commit hash 
abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.

Following Hudi session reproduces the above issue:

spark.read.format("hudi")
            .option("hoodie.datasource.query.type", "incremental") 
.load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")

java.lang.NullPointerException
  at 
org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
  at 
org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
  at 
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
  at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
  ... 50 elided  

Find attached a few screenshots which show the code flow and the buggy state of 
the variables. Also find attached a Java file and pom.xml that can be used to 
reproduce the same (sorry don't have deanonymized table -to share yet).-

The bug seems to have been introduced in this particular PR change: 
[https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]

Initial work on that file has happened in this particular Jira 
(https://issues.apache.org/jira/browse/HUDI-4363) and PR 
(https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to