hudi-bot opened a new issue, #15750:
URL: https://github.com/apache/hudi/issues/15750

   If there are no completed instants in the table, and there is no user 
defined schema for it as well (as represented by the userSpecifiedSchema field 
in DataSource.scala), then the EmptyRelation returned by 
DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
breaks the contract of Spark's BaseRelation, where the schema is a StructType 
but is not expected to be null.
   
   Module versions: current apache-hudi master (commit hash 
abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
   
   Following Hudi session reproduces the above issue:
   
   spark.read.format("hudi")
               .option("hoodie.datasource.query.type", "incremental") 
.load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
   
   java.lang.NullPointerException
     at 
org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
     at 
org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
     at 
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
     at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
     at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
     at scala.Option.getOrElse(Option.scala:189)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
     ... 50 elided  
   
   Find attached a few screenshots which show the code flow and the buggy state 
of the variables. Also find attached a Java file and pom.xml that can be used 
to reproduce the same (sorry don't have deanonymized table -to share yet).-
   
   The bug seems to have been introduced in this particular PR change: 
[https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
   
   Initial work on that file has happened in this particular Jira 
(https://issues.apache.org/jira/browse/HUDI-4363) and PR 
(https://github.com/apache/hudi/pull/6046) respectively.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5688
   - Type: Bug
   - Fix version(s):
     - 1.1.0
   - Attachment(s):
     - 02/Feb/23 
12:29;pramodbiligiri;1-userSpecifiedSchema-is-null.png;https://issues.apache.org/jira/secure/attachment/13055052/1-userSpecifiedSchema-is-null.png
     - 02/Feb/23 
12:29;pramodbiligiri;2-empty-relation.png;https://issues.apache.org/jira/secure/attachment/13055053/2-empty-relation.png
     - 02/Feb/23 
12:29;pramodbiligiri;3-table-schema-will-not-resolve.png;https://issues.apache.org/jira/secure/attachment/13055054/3-table-schema-will-not-resolve.png
     - 02/Feb/23 
12:29;pramodbiligiri;4-resolve-schema-returns-null.png;https://issues.apache.org/jira/secure/attachment/13055055/4-resolve-schema-returns-null.png
     - 02/Feb/23 
12:41;pramodbiligiri;Main.java;https://issues.apache.org/jira/secure/attachment/13055051/Main.java
     - 02/Feb/23 
12:41;pramodbiligiri;pom.xml;https://issues.apache.org/jira/secure/attachment/13055050/pom.xml
   
   
   ---
   
   
   ## Comments
   
   06/Feb/23 07:48;pramodbiligiri;A small workaround for the null value, that 
shows that the bug diagnosis is valid: 
[https://github.com/apache/hudi/pull/7864]
   
   Not sure if the above change can be considered a fix to the issue.;;;
   
   ---
   
   14/Feb/23 00:27;shivnarayan;One of hudi engineer will take over. thanks for 
reporting. looks like its a valid bug. 
   
   I could reproduce w/ latest master when the table does not have any 
completed commits. 
   {code:java}
   scala> val tripsSnapshotDF = spark.
        |   read.
        |   format("hudi").
        |   load(basePath)
   23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   23/02/13 16:21:54 INFO DataSourceUtils: Getting table path..
   23/02/13 16:21:54 INFO TablePathUtils: Getting table path from path : 
file:/tmp/hudi_trips_cow
   23/02/13 16:21:54 INFO DefaultSource: Obtained hudi table path: 
file:/tmp/hudi_trips_cow
   23/02/13 16:21:54 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from file:/tmp/hudi_trips_cow
   23/02/13 16:21:54 INFO HoodieTableConfig: Loading table properties from 
file:/tmp/hudi_trips_cow/.hoodie/hoodie.properties
   23/02/13 16:21:54 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from file:/tmp/hudi_trips_cow
   23/02/13 16:21:54 INFO DefaultSource: Is bootstrapped table => false, 
tableType is: COPY_ON_WRITE, queryType is: snapshot
   23/02/13 16:21:54 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[==>20230213161934447__commit__INFLIGHT]}
   java.lang.NullPointerException
     at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
     at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
     at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
     ... 64 elided {code};;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to