hudi-bot opened a new issue, #15750:
URL: https://github.com/apache/hudi/issues/15750
If there are no completed instants in the table, and there is no user
defined schema for it as well (as represented by the userSpecifiedSchema field
in DataSource.scala), then the EmptyRelation returned by
DefaultSource.createRelation sets schema of the EmptyRelation to null. This
breaks the contract of Spark's BaseRelation, where the schema is a StructType
but is not expected to be null.
Module versions: current apache-hudi master (commit hash
abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
Following Hudi session reproduces the above issue:
spark.read.format("hudi")
.option("hoodie.datasource.query.type", "incremental")
.load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
at
org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
at
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
... 50 elided
Find attached a few screenshots which show the code flow and the buggy state
of the variables. Also find attached a Java file and pom.xml that can be used
to reproduce the same (sorry don't have deanonymized table -to share yet).-
The bug seems to have been introduced in this particular PR change:
[https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
Initial work on that file has happened in this particular Jira
(https://issues.apache.org/jira/browse/HUDI-4363) and PR
(https://github.com/apache/hudi/pull/6046) respectively.
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-5688
- Type: Bug
- Fix version(s):
- 1.1.0
- Attachment(s):
- 02/Feb/23
12:29;pramodbiligiri;1-userSpecifiedSchema-is-null.png;https://issues.apache.org/jira/secure/attachment/13055052/1-userSpecifiedSchema-is-null.png
- 02/Feb/23
12:29;pramodbiligiri;2-empty-relation.png;https://issues.apache.org/jira/secure/attachment/13055053/2-empty-relation.png
- 02/Feb/23
12:29;pramodbiligiri;3-table-schema-will-not-resolve.png;https://issues.apache.org/jira/secure/attachment/13055054/3-table-schema-will-not-resolve.png
- 02/Feb/23
12:29;pramodbiligiri;4-resolve-schema-returns-null.png;https://issues.apache.org/jira/secure/attachment/13055055/4-resolve-schema-returns-null.png
- 02/Feb/23
12:41;pramodbiligiri;Main.java;https://issues.apache.org/jira/secure/attachment/13055051/Main.java
- 02/Feb/23
12:41;pramodbiligiri;pom.xml;https://issues.apache.org/jira/secure/attachment/13055050/pom.xml
---
## Comments
06/Feb/23 07:48;pramodbiligiri;A small workaround for the null value, that
shows that the bug diagnosis is valid:
[https://github.com/apache/hudi/pull/7864]
Not sure if the above change can be considered a fix to the issue.;;;
---
14/Feb/23 00:27;shivnarayan;One of hudi engineer will take over. thanks for
reporting. looks like its a valid bug.
I could reproduce w/ latest master when the table does not have any
completed commits.
{code:java}
scala> val tripsSnapshotDF = spark.
| read.
| format("hudi").
| load(basePath)
23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Cannot find
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Properties file
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
23/02/13 16:21:54 INFO DataSourceUtils: Getting table path..
23/02/13 16:21:54 INFO TablePathUtils: Getting table path from path :
file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO DefaultSource: Obtained hudi table path:
file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO HoodieTableConfig: Loading table properties from
file:/tmp/hudi_trips_cow/.hoodie/hoodie.properties
23/02/13 16:21:54 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO DefaultSource: Is bootstrapped table => false,
tableType is: COPY_ON_WRITE, queryType is: snapshot
23/02/13 16:21:54 INFO HoodieActiveTimeline: Loaded instants upto :
Option{val=[==>20230213161934447__commit__INFLIGHT]}
java.lang.NullPointerException
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
... 64 elided {code};;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]