[
https://issues.apache.org/jira/browse/HUDI-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611651#comment-17611651
]
Raymond Xu commented on HUDI-4765:
----------------------------------
verified record key gen aligned
{code:java}
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+----+-----+----+
|_hoodie_commit_time|_hoodie_commit_seqno
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|id |name|price|ts |
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+----+-----+----+
|20220930232725894 |20220930232725894_0_0|1 |
|eca7b1ed-c876-4ccf-b0e6-e21fd06965ff-0_0-14-10_20220930233001187.parquet|1
|a1 |20.0 |1000|
|20220930233001187 |20220930233001187_0_1|2 |
|eca7b1ed-c876-4ccf-b0e6-e21fd06965ff-0_0-14-10_20220930233001187.parquet|2
|a2 |200.0|100 |
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+----+-----+----+
{code}
> Compared inserting data via spark-sql with spark-shell,_hoodie_record_key
> generation logic is different, which might affects data upsert
> ----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-4765
> URL: https://issues.apache.org/jira/browse/HUDI-4765
> Project: Apache Hudi
> Issue Type: Bug
> Components: spark, spark-sql
> Affects Versions: 0.11.1
> Environment: Spark 3.1.1
> Hudi 0.11.1
> Reporter: Yao Zhang
> Assignee: Raymond Xu
> Priority: Critical
> Fix For: 0.12.1
>
>
> Create table using spark-sql:
> {code:java}
> create table hudi_mor_tbl (
> id int,
> name string,
> price double,
> ts bigint
> ) using hudi
> tblproperties (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'ts'
> )
> location 'hdfs:///hudi/hudi_mor_tbl'; {code}
> And then insert data via spark-shell and spark-sql respectively:
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val fields = Array(
> StructField("id", IntegerType, true),
> StructField("name", StringType, true),
> StructField("price", DoubleType, true),
> StructField("ts", LongType, true)
> )
> val simpleSchema = StructType(fields)
> val data = Seq(Row(2, "a2", 200.0, 100L))
> val df = spark.createDataFrame(data, simpleSchema)
> df.write.format("hudi").
> option(PRECOMBINE_FIELD_OPT_KEY, "ts").
> option(RECORDKEY_FIELD_OPT_KEY, "id").
> option(TABLE_NAME, "hudi_mor_tbl").
> option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
> mode(Append).
> save("hdfs:///hudi/hudi_mor_tbl") {code}
> {code:java}
> insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code}
> After that we query the table, we can see those two rows are as below:
> {code:java}
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
> _hoodie_file_name| id|name|price| ts|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> | 20220902012710792|20220902012710792...| 2|
> |c3eff8c8-fa47-48c...| 2| a2|200.0| 100|
> | 20220902012813658|20220902012813658...| id:1|
> |c3eff8c8-fa47-48c...| 1| a1| 20.0|1000|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> {code}
> '_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that
> for spark-shell is 2. It seems that spark_sql uses
> '[primaryKey_field_name]:[primaryKey_field_value]' to construct the
> '_hoodie_record_key' field, which is different from spark-shell.
> As a result, if we inserted one row via spark-sql and then upserted it via
> spark-shell, we would get two duplicated rows. That is not what we expected.
> Did I miss some configurations that might lead to this issue? If not,
> personally I think we should make the default record key generation logic
> consistent.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)