[
https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688209#comment-17688209
]
sivabalan narayanan commented on HUDI-5688:
-------------------------------------------
One of hudi engineer will take over. thanks for reporting. looks like its a
valid bug.
I could reproduce w/ latest master when the table does not have any completed
commits.
{code:java}
scala> val tripsSnapshotDF = spark.
| read.
| format("hudi").
| load(basePath)
23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR,
please set it as the dir of hudi-defaults.conf
23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Properties file
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
23/02/13 16:21:54 INFO DataSourceUtils: Getting table path..
23/02/13 16:21:54 INFO TablePathUtils: Getting table path from path :
file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO DefaultSource: Obtained hudi table path:
file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO HoodieTableConfig: Loading table properties from
file:/tmp/hudi_trips_cow/.hoodie/hoodie.properties
23/02/13 16:21:54 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO DefaultSource: Is bootstrapped table => false, tableType
is: COPY_ON_WRITE, queryType is: snapshot
23/02/13 16:21:54 INFO HoodieActiveTimeline: Loaded instants upto :
Option{val=[==>20230213161934447__commit__INFLIGHT]}
java.lang.NullPointerException
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
... 64 elided {code}
> schema field of EmptyRelation subtype of BaseRelation should not be null
> ------------------------------------------------------------------------
>
> Key: HUDI-5688
> URL: https://issues.apache.org/jira/browse/HUDI-5688
> Project: Apache Hudi
> Issue Type: Bug
> Components: core
> Reporter: Pramod Biligiri
> Priority: Major
> Labels: pull-request-available
> Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png,
> 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png,
> Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined
> schema for it as well (as represented by the userSpecifiedSchema field in
> DataSource.scala), then the EmptyRelation returned by
> DefaultSource.createRelation sets schema of the EmptyRelation to null. This
> breaks the contract of Spark's BaseRelation, where the schema is a StructType
> but is not expected to be null.
> Module versions: current apache-hudi master (commit hash
> abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
> .option("hoodie.datasource.query.type", "incremental")
> .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
> at
> org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
> at
> org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
> at
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
> at
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
> ... 50 elided
> Find attached a few screenshots which show the code flow and the buggy state
> of the variables. Also find attached a Java file and pom.xml that can be used
> to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change:
> [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira
> (https://issues.apache.org/jira/browse/HUDI-4363) and PR
> (https://github.com/apache/hudi/pull/6046) respectively.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)