[
https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-5688:
---------------------------------
Labels: pull-request-available (was: )
> schema field of EmptyRelation subtype of BaseRelation should not be null
> ------------------------------------------------------------------------
>
> Key: HUDI-5688
> URL: https://issues.apache.org/jira/browse/HUDI-5688
> Project: Apache Hudi
> Issue Type: Bug
> Components: core
> Reporter: Pramod Biligiri
> Priority: Major
> Labels: pull-request-available
> Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png,
> 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png,
> Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined
> schema for it as well (as represented by the userSpecifiedSchema field in
> DataSource.scala), then the EmptyRelation returned by
> DefaultSource.createRelation sets schema of the EmptyRelation to null. This
> breaks the contract of Spark's BaseRelation, where the schema is a StructType
> but is not expected to be null.
> Module versions: current apache-hudi master (commit hash
> abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
> .option("hoodie.datasource.query.type", "incremental")
> .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
> at
> org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
> at
> org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
> at
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
> at
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
> ... 50 elided
> Find attached a few screenshots which show the code flow and the buggy state
> of the variables. Also find attached a Java file and pom.xml that can be used
> to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change:
> [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira
> (https://issues.apache.org/jira/browse/HUDI-4363) and PR
> (https://github.com/apache/hudi/pull/6046) respectively.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)