[jira] [Commented] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null

sivabalan narayanan (Jira) Mon, 13 Feb 2023 16:28:06 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688209#comment-17688209
 ]


sivabalan narayanan commented on HUDI-5688:
-------------------------------------------

One of hudi engineer will take over. thanks for reporting. looks like its a 
valid bug. 

I could reproduce w/ latest master when the table does not have any completed 
commits. 
{code:java}
scala> val tripsSnapshotDF = spark.
     |   read.
     |   format("hudi").
     |   load(basePath)
23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, 
please set it as the dir of hudi-defaults.conf
23/02/13 16:21:54 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
23/02/13 16:21:54 INFO DataSourceUtils: Getting table path..
23/02/13 16:21:54 INFO TablePathUtils: Getting table path from path : 
file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO DefaultSource: Obtained hudi table path: 
file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO HoodieTableConfig: Loading table properties from 
file:/tmp/hudi_trips_cow/.hoodie/hoodie.properties
23/02/13 16:21:54 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from file:/tmp/hudi_trips_cow
23/02/13 16:21:54 INFO DefaultSource: Is bootstrapped table => false, tableType 
is: COPY_ON_WRITE, queryType is: snapshot
23/02/13 16:21:54 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[==>20230213161934447__commit__INFLIGHT]}
java.lang.NullPointerException
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:417)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:197)
  ... 64 elided {code}

> schema field of EmptyRelation subtype of BaseRelation should not be null
> ------------------------------------------------------------------------
>
>                 Key: HUDI-5688
>                 URL: https://issues.apache.org/jira/browse/HUDI-5688
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: core
>            Reporter: Pramod Biligiri
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: 1-userSpecifiedSchema-is-null.png, 2-empty-relation.png, 
> 3-table-schema-will-not-resolve.png, 4-resolve-schema-returns-null.png, 
> Main.java, pom.xml
>
>
> If there are no completed instants in the table, and there is no user defined 
> schema for it as well (as represented by the userSpecifiedSchema field in 
> DataSource.scala), then the EmptyRelation returned by 
> DefaultSource.createRelation sets schema of the EmptyRelation to null. This 
> breaks the contract of Spark's BaseRelation, where the schema is a StructType 
> but is not expected to be null.
> Module versions: current apache-hudi master (commit hash 
> abe26d4169c04da05b99941161621876e3569e96) built with spark3.2 and scala-2.12.
> Following Hudi session reproduces the above issue:
> spark.read.format("hudi")
>             .option("hoodie.datasource.query.type", "incremental") 
> .load("SOME_HUDI_TABLE_WITH_NO_COMPLETED_INSTANTS_OR_USER_SPECIFIED_SCHEMA")
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.util.CharVarcharUtils$.replaceCharVarcharWithStringInSchema(CharVarcharUtils.scala:41)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation$.apply(LogicalRelation.scala:76)
>   at 
> org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:440)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
>   ... 50 elided  
> Find attached a few screenshots which show the code flow and the buggy state 
> of the variables. Also find attached a Java file and pom.xml that can be used 
> to reproduce the same (sorry don't have deanonymized table -to share yet).-
> The bug seems to have been introduced in this particular PR change: 
> [https://github.com/apache/hudi/pull/6727/files#diff-4cfd87bb9200170194a633746094de138c3a0e3976d351d0d911ee95651256acR220]
> Initial work on that file has happened in this particular Jira 
> (https://issues.apache.org/jira/browse/HUDI-4363) and PR 
> (https://github.com/apache/hudi/pull/6046) respectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-5688) schema field of EmptyRelation subtype of BaseRelation should not be null

Reply via email to