[GitHub] [hudi] whight opened a new issue, #8451: [SUPPORT] Insert write operation pre combined problem

via GitHub Thu, 13 Apr 2023 04:27:35 -0700


whight opened a new issue, #8451:
URL: https://github.com/apache/hudi/issues/8451


   **Describe the problem you faced**
   
   I used Spark structured streaming import Kafka data to Hudi table, Kafka 
message contain many same id records. The write operation is INSERT means that 
pre combined will be not work, but I found many rows in the table are upserted, 
only little rows of duplicate key are kept in table, why?
   
   **Expected behavior**
   Every row of duplicate key  will stored in table
   
   **Environment Description**
   
   * Hudi version : 0.10.0
   
   * Spark version : 3.2.1
   
   * Hive version : 3.1.2
   
   * Hadoop version :3.2.1
   
   * Storage (HDFS/S3/GCS..) : Aliyun oss
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   This is spark writer
   ```scala
   dataSet.writeStream
     .format("org.apache.hudi")
     .option(DataSourceWriteOptions.TABLE_TYPE.key(), 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) 
     .option(DataSourceWriteOptions.OPERATION.key(), 
DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) 
     .option(DataSourceWriteOptions.RECORDKEY_FIELD.key(), "id") 
     .option(DataSourceWriteOptions.PRECOMBINE_FIELD.key(), "timestamp")
     .option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "day") 
     .option(DataSourceWriteOptions.HIVE_URL.key(), "") 
     .option(DataSourceWriteOptions.HIVE_DATABASE.key(), dbName) 
     .option(DataSourceWriteOptions.HIVE_TABLE.key(), tableName) 
     .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS.key(), "day") 
     .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED.key(), "true") 
     .option("hoodie.table.name", tableName) 
     .option("hoodie.bulkinsert.shuffle.parallelism", "6")
     .option("hoodie.insert.shuffle.parallelism", "6")
     .option("hoodie.upsert.shuffle.parallelism", "6")
     .option(HIVE_STYLE_PARTITIONING.key(), "true")
     .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_NAME.key(), 
"org.apache.hudi.keygen.ComplexKeyGenerator")
     .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING.key(), "true") 
     .option("path", "")
     .option("checkpointLocation", "")
     .start()
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] whight opened a new issue, #8451: [SUPPORT] Insert write operation pre combined problem

Reply via email to