[I] [SUPPORT] Hudi table created with dataframe API becomes unwritable to INSERT queries due to config conflict [hudi]

via GitHub Tue, 13 Aug 2024 14:57:09 -0700


CTTY opened a new issue, #11772:
URL: https://github.com/apache/hudi/issues/11772


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   `INSERT INTO` queries would fail on table created with dataframe API due to 
config conflict. The exception below shows conflict on precombine field but I 
believe this can happen on any datasource config.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Open a scala shell with Hudi spark bundle and create a table with 
dataframe API, sample script:
   ```
   import org.apache.hudi.DataSourceWriteOptions
   
   import org.apache.spark.sql.SaveMode
   
   val df1 = Seq(
    ("100", "2015-01-01", "event_name_900", "2015-01-01T13:51:39.340396Z", 
"type1"),
    ("101", "2015-01-01", "event_name_546", "2015-01-01T12:14:58.597216Z", 
"type2"),
    ("102", "2015-01-01", "event_name_345", "2015-01-01T13:51:40.417052Z", 
"type3"),
    ("103", "2015-01-01", "event_name_234", "2015-01-01T13:51:40.519832Z", 
"type4"),
    ("104", "2015-01-01", "event_name_123", "2015-01-01T12:15:00.512679Z", 
"type1"),
    ("105", "2015-01-01", "event_name_678", "2015-01-01T13:51:42.248818Z", 
"type2"),
    ("106", "2015-01-01", "event_name_890", "2015-01-01T13:51:44.735360Z", 
"type3"),
    ("107", "2015-01-01", "event_name_944", "2015-01-01T13:51:45.019544Z", 
"type4"),
    ("108", "2015-01-01", "event_name_456", "2015-01-01T13:51:45.208007Z", 
"type1"),
    ("109", "2015-01-01", "event_name_567", "2015-01-01T13:51:45.369689Z", 
"type2"),
    ("110", "2015-01-01", "event_name_789", "2015-01-01T12:15:05.664947Z", 
"type3"),
    ("111", "2015-01-01", "event_name_322", "2015-01-01T13:51:47.388239Z", 
"type4")
    ).toDF("event_id", "event_date", "event_name", "event_ts", "event_type")
   
   val r = scala.util.Random
   val num =  r.nextInt(99999)
   var tableName = "tableName" + num
   var tablePath = "table path"
   
   df1.write.format("hudi")
    .option("hoodie.metadata.enable", "true")
    .option("hoodie.table.name", tableName)
    .option("hoodie.datasource.write.operation", "upsert")
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
    .option("hoodie.datasource.write.recordkey.field", "event_id,event_date")
    .option("hoodie.datasource.write.partitionpath.field", "event_type") 
    .option("hoodie.datasource.write.precombine.field", "event_ts")
    .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.ComplexKeyGenerator")
   //  .option("hoodie.datasource.write.hive_style_partitioning", "true")
    .option("hoodie.datasource.hive_sync.enable", "true")
    .option("hoodie.datasource.meta.sync.enable", "true")
    .option("hoodie.meta.sync.client.tool.class", 
"org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool")
    .option("hoodie.datasource.hive_sync.mode", "hms")
    .option("hoodie.datasource.hive_sync.database", "default")
    .option("hoodie.datasource.hive_sync.table", tableName)
    .option("hoodie.datasource.hive_sync.partition_fields", "event_type")
    .option("hoodie.datasource.hive_sync.partition_extractor_class", 
"org.apache.hudi.hive.MultiPartKeysValueExtractor")
    .mode(SaveMode.Append)
    .save(tablePath)
   ```
   2. Run `INSERT INTO` with spark-sql
   ```
   INSERT INTO table_name (event_id, event_date, event_name, event_ts, 
event_type) VALUES ('131', '2015-01-01', 'event_name_567', 
'2015-01-01T13:51:45.369689Z', 'type2')
   ```
   
   **Expected behavior**
   
   `INSERT INTO` should work on tables created with dataframe api
   
   **Environment Description**
   EMR-7.2
   
   * Hudi version : 0.14.1 (Hudi 0.15 or Spark 3.4 should have the same problem)
   
   * Spark version : 3.5.0
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   This issue doesn't happen if table was created with SQL
   I suspect this is related to hive sync. I used Glue as the catalog and I 
don't see precombine config synced to Glue when creating the table with 
dataframe. And the precombine field cannot be inferred 
[here](https://github.com/apache/hudi/blob/9db0a60e677d6e38d2ecaba08cac112652a05bb8/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala#L536)
 correctly becasue catalog doesn't have the precombine info. If the table was 
created with SQL then the precombine field would be synced to Glue and referred 
correctly when inserting data. 
   
   **Stacktrace**
   
   ```
   org.apache.hudi.exception.HoodieException: Config conflict(key       current 
value   existing value):
   PreCombineKey:               event_ts
        at 
org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:212)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:249)
        at 
org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:204)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:121)
        at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:108)
        at 
org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand.run(InsertIntoHoodieTableCommand.scala:61)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
        at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:126)
        at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:108)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Hudi table created with dataframe API becomes unwritable to INSERT queries due to config conflict [hudi]

Reply via email to