nitinkul commented on issue #5582:
URL: https://github.com/apache/hudi/issues/5582#issuecomment-1246655154

   Any update on the issue. 
   I am facing exactly same issue. difference being i wrote a bootstrap job to 
do ```bulk_insert``` using hudi-spark job and then i was running incremental 
run using dbt-spark with hudi.
   
   o/p of the commands suggested by @minihippo 
   
   ```CatalogTable(
   Database: data_model
   Table: click_fact
   Owner: hive
   Created Time: Tue Sep 13 22:17:07 UTC 2022
   Last Access: UNKNOWN
   Created By: Spark 2.2 or prior
   Type: EXTERNAL
   Provider: hudi
   Table Properties: [bucketing_version=2, 
last_commit_time_sync=20220914060931454]
   Statistics: 823082628 bytes
   Location: <masked>
   Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   InputFormat: org.apache.hudi.hadoop.HoodieParquetInputFormat
   OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
   Storage Properties: [hoodie.query.as.ro.table=false]
   Schema: root
    |-- _hoodie_commit_time: string (nullable = true)
    |-- _hoodie_commit_seqno: string (nullable = true)
    |-- _hoodie_record_key: string (nullable = true)
    |-- _hoodie_partition_path: string (nullable = true)
    |-- _hoodie_file_name: string (nullable = true)
    |-- user_customer_id: string (nullable = true)
    |-- Permissions_Min: string (nullable = true)
    |-- Permissions_Max: string (nullable = true)
   )
   ```
   hudi params used for bulk_insert
   ```
   .option("hoodie.bulkinsert.shuffle.parallelism", 
appConfig.getNumPartitions())
                   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), 
"COPY_ON_WRITE")
                   .option(DataSourceWriteOptions.OPERATION_OPT_KEY(), 
DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL())
                   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
appConfig.getRecordKey()) //"user_customer_id"
                   
.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),appConfig.getPreCombineKey())
 //"user_customer_id"
                   .option(HoodieWriteConfig.TABLE_NAME, 
appConfig.getTblName()) //"click_fact"
                   .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY(), 
appConfig.getDbName()) //"data_model"
                   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), 
appConfig.getTblName()) //"click_fact"
                   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(), 
"true")
                   .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(), 
"org.apache.hudi.keygen.NonpartitionedKeyGenerator")
                   
.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(), 
"org.apache.hudi.hive.NonPartitionedExtractor")
                   .option("hoodie.parquet.compression.codec", "snappy")
                   .format("org.apache.hudi")
                   .mode(SaveMode.Append)
                   .save(appConfig.getOutputPath());
   ```
   following is the stacktrace
   ```22/09/14 11:25:25 ERROR SparkExecuteStatementOperation: Error executing 
query with c8765110-e83d-4133-8ebb-3590579213b4, currentState RUNNING,
   java.lang.NullPointerException
        at java.util.Hashtable.put(Hashtable.java:460)
        at java.util.Hashtable.putAll(Hashtable.java:524)
        at 
org.apache.hudi.HoodieWriterUtils$.parametersWithWriteDefaults(HoodieWriterUtils.scala:52)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.mergeParamsAndGetHoodieConfig(HoodieSparkSqlWriter.scala:722)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:91)
        at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.executeUpsert(MergeIntoHoodieTableCommand.scala:285)
        at 
org.apache.spark.sql.hudi.command.MergeIntoHoodieTableCommand.run(MergeIntoHoodieTableCommand.scala:155)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
        at 
org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:230)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3751)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
        at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3749)
        at org.apache.spark.sql.Dataset.<init>(Dataset.scala:230)
        at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:101)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to