vicuna96 opened a new issue, #5582: URL: https://github.com/apache/hudi/issues/5582
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Hi team, we are getting a NullPointerException in trying to use a merge statement to update columns in a table which is saved in hive. We perform the initial load of the table using hive sync options, but we do not use these hive sync options in subsequent runs as this would lead to a `java.lang.NoSuchMethodError: org.apache.hadoop.hive.metastore.IMetaStoreClient.alter_table_with_environmentContext(Ljava/lang/String;Ljava/lang/String;Lorg/apache/hadoop/hive/metastore/api/Table;Lorg/apache/hadoop/hive/metastore/api/EnvironmentContext;)V `. **To Reproduce** Steps to reproduce the behavior: 1. Created a table with hms hive sync using the following syntax ``` sfoSubDF.write.format("hudi"). options(hudiOptions). option(TABLE_TYPE.key(), "COPY_ON_WRITE"). option(OPERATION.key(), "bulk_insert"). option(KEYGENERATOR_CLASS_NAME.key(), "org.apache.hudi.keygen.ComplexKeyGenerator"). option(PRECOMBINE_FIELD.key(), "PROCESSING_TS"). option(RECORDKEY_FIELD.key(), "KEY1,KEY2"). option(PARTITIONPATH_FIELD.key(), "PARTITION_DT"). option(HIVE_STYLE_PARTITIONING.key(), "true"). option(HIVE_SYNC_MODE.key(), "hms"). option(HIVE_DATABASE.key(), database). option(HIVE_TABLE.key(), tableName). option(HIVE_SYNC_ENABLED.key(), "true"). option(TBL_NAME.key(), tableName). mode(Overwrite). save(toPath) ``` Then we attempt to run a partial update on top of the table, using the merge spark-sql syntax ``` merge into $HIVE_DB.$datasetName as target using $sourceAliasOrder as source on ${getDefaultMergeCondition()} when matched and ${PTC.RECORD_TS} <> source.${PTC.RECORD_TS} then update set ${PTC.RECORD_TS} = source.${PTC.RECORD_TS} ``` This automatically raises this null pointer exception, on a call to parametersWithWriteDefaults function as detailed in the stack trace included. **Expected behavior** We are expecting the partial update `merge into` statement to lead to an update of the corresponding columns in the base table. Note that since we are not able to use hive sync using hms on hudi (as described in https://github.com/apache/hudi/issues/4700) we would then run msck repair to update any necessary table metadata. **Environment Description** * Hudi version : 0.10.0 * Spark version : 2.4.7 * Hive version : 2.3.7 * Hadoop version : 2.10.1 * Storage (HDFS/S3/GCS..) : GCS * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. **Stacktrace** ``` 22/05/14 00:50:12 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from gs://my_hudi_bucket/staging_zone/WorkflowPublish/orderTableTesting 22/05/14 00:50:12 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from gs://my_hudi_bucket/staging_zone/WorkflowPublish/orderTableTesting/.hoodie/hoodie.properties 22/05/14 00:50:12 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs://my_hudi_bucket/staging_zone/WorkflowPublish/orderTableTesting 22/05/14 00:50:12 INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline: Loaded instants upto : Option{val=[20220513213440105__commit__COMPLETED]} 22/05/14 00:50:12 WARN org.apache.hudi.common.config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf 22/05/14 00:50:12 WARN org.apache.hudi.common.config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file 22/05/14 00:50:12 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Loading HoodieTableMetaClient from gs://my_hudi_bucket/staging_zone/WorkflowPublish/orderTableTesting 22/05/14 00:50:12 INFO org.apache.hudi.common.table.HoodieTableConfig: Loading table properties from gs://my_hudi_bucket/staging_zone/WorkflowPublish/orderTableTesting/.hoodie/hoodie.properties 22/05/14 00:50:12 INFO org.apache.hudi.common.table.HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from gs://my_hudi_bucket/staging_zone/WorkflowPublish/orderTableTesting 22/05/14 00:50:12 WARN com.google.cloud.hadoop.fs.gcs.GoogleHadoopSyncableOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://opddev-dev-dpaas-phs-logs/history-server/spark-events/ghs-gif-streaming/application_1652454321869_0247_1.lz4.inprogress 22/05/14 00:50:12 ERROR com.walmart.archetype.core.WorkFlowManager: Exception while running Some(WorkflowPublish) Exception = null 22/05/14 00:50:12 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: User class threw exception: java.lang.NullPointerException java.lang.NullPointerException at java.util.Hashtable.put(Hashtable.java:460) at java.util.Hashtable.putAll(Hashtable.java:524) at org.apache.hudi.HoodieWriterUtils$.parametersWithWriteDefaults(HoodieWriterUtils.scala:52) at org.apache.hudi.HoodieSparkSqlWriter$.mergeParamsAndGetHoodieConfig(HoodieSparkSqlWriter.scala:722) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:91) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:285) at org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:155) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:194) at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3369) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:194) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:643) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
