nitinkul commented on issue #5582:
URL: https://github.com/apache/hudi/issues/5582#issuecomment-1246655154
Any update on the issue.
I am facing exactly same issue. difference being i wrote a bootstrap job to
do ```bulk_insert``` using hudi-spark job and then i was running incremental
run using dbt-spark with hudi.
o/p of the commands suggested by @minihippo
```CatalogTable(
Database: data_model
Table: click_fact
Owner: hive
Created Time: Tue Sep 13 22:17:07 UTC 2022
Last Access: UNKNOWN
Created By: Spark 2.2 or prior
Type: EXTERNAL
Provider: hudi
Table Properties: [bucketing_version=2,
last_commit_time_sync=20220914060931454]
Statistics: 823082628 bytes
Location: <masked>
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hudi.hadoop.HoodieParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [hoodie.query.as.ro.table=false]
Schema: root
|-- _hoodie_commit_time: string (nullable = true)
|-- _hoodie_commit_seqno: string (nullable = true)
|-- _hoodie_record_key: string (nullable = true)
|-- _hoodie_partition_path: string (nullable = true)
|-- _hoodie_file_name: string (nullable = true)
|-- user_customer_id: string (nullable = true)
|-- Permissions_Min: string (nullable = true)
|-- Permissions_Max: string (nullable = true)
)
```
hudi params used for bulk_insert
```
.option("hoodie.bulkinsert.shuffle.parallelism",
appConfig.getNumPartitions())
.option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(),
"COPY_ON_WRITE")
.option(DataSourceWriteOptions.OPERATION_OPT_KEY(),
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())
.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),
appConfig.getRecordKey()) //"user_customer_id"
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),appConfig.getPreCombineKey())
//"user_customer_id"
.option(HoodieWriteConfig.TABLE_NAME,
appConfig.getTblName()) //"click_fact"
.option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(),
appConfig.getDbName()) //"data_model"
.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(),
appConfig.getTblName()) //"click_fact"
.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(),
"true")
.option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(),
"org.apache.hudi.keygen.NonpartitionedKeyGenerator")
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(),
"org.apache.hudi.hive.NonPartitionedExtractor")
.option("hoodie.parquet.compression.codec", "snappy")
.format("org.apache.hudi")
.mode(SaveMode.Append)
.save(appConfig.getOutputPath());
```
following is the stacktrace
```22/09/14 11:25:25 ERROR SparkExecuteStatementOperation: Error executing
query with c8765110-e83d-4133-8ebb-3590579213b4, currentState RUNNING,
java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:460)
at java.util.Hashtable.putAll(Hashtable.java:524)
at
org.apache.hudi.HoodieWriterUtils$.parametersWithWriteDefaults(HoodieWriterUtils.scala:52)
at
org.apache.hudi.HoodieSparkSqlWriter$.mergeParamsAndGetHoodieConfig(HoodieSparkSqlWriter.scala:722)
at
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:91)
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:285)
at
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:155)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:230)
at
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3751)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
at
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3749)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:230)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]