[jira] [Commented] (HUDI-4765) Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert

Raymond Xu (Jira) Fri, 30 Sep 2022 08:32:12 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-4765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17611651#comment-17611651
 ]


Raymond Xu commented on HUDI-4765:
----------------------------------

verified record key gen aligned


{code:java}
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+----+-----+----+
|_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                    
                                   |id |name|price|ts  |
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+----+-----+----+
|20220930232725894  |20220930232725894_0_0|1                 |                  
    |eca7b1ed-c876-4ccf-b0e6-e21fd06965ff-0_0-14-10_20220930233001187.parquet|1 
 |a1  |20.0 |1000|
|20220930233001187  |20220930233001187_0_1|2                 |                  
    |eca7b1ed-c876-4ccf-b0e6-e21fd06965ff-0_0-14-10_20220930233001187.parquet|2 
 |a2  |200.0|100 |
+-------------------+---------------------+------------------+----------------------+------------------------------------------------------------------------+---+----+-----+----+
{code}


> Compared inserting data via spark-sql with spark-shell,_hoodie_record_key 
> generation logic is different, which might affects data upsert
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-4765
>                 URL: https://issues.apache.org/jira/browse/HUDI-4765
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: spark, spark-sql
>    Affects Versions: 0.11.1
>         Environment: Spark 3.1.1
> Hudi 0.11.1
>            Reporter: Yao Zhang
>            Assignee: Raymond Xu
>            Priority: Critical
>             Fix For: 0.12.1
>
>
> Create table using spark-sql:
> {code:java}
> create table hudi_mor_tbl (
>   id int,
>   name string,
>   price double,
>   ts bigint
> ) using hudi
> tblproperties (
>   type = 'mor',
>   primaryKey = 'id',
>   preCombineField = 'ts'
> )
> location 'hdfs:///hudi/hudi_mor_tbl'; {code}
> And then insert data via spark-shell and spark-sql respectively:
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val fields = Array(
>       StructField("id", IntegerType, true),
>       StructField("name", StringType, true),
>       StructField("price", DoubleType, true),
>       StructField("ts", LongType, true)
>   )
> val simpleSchema = StructType(fields)
> val data = Seq(Row(2, "a2", 200.0, 100L))
> val df = spark.createDataFrame(data, simpleSchema)
> df.write.format("hudi").
>   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>   option(RECORDKEY_FIELD_OPT_KEY, "id").
>   option(TABLE_NAME, "hudi_mor_tbl").
>   option(TABLE_TYPE_OPT_KEY, "MERGE_ON_READ").
>   mode(Append).
>   save("hdfs:///hudi/hudi_mor_tbl") {code}
> {code:java}
> insert into hudi_mor_tbl select 1, 'a1', 20, 1000; {code}
> After that we query the table, we can see those two rows are as below:
> {code:java}
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>    _hoodie_file_name| id|name|price|  ts|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
> |  20220902012710792|20220902012710792...|                 2|                 
>      |c3eff8c8-fa47-48c...|  2|  a2|200.0| 100|
> |  20220902012813658|20220902012813658...|              id:1|                 
>      |c3eff8c8-fa47-48c...|  1|  a1| 20.0|1000|
> +-------------------+--------------------+------------------+----------------------+--------------------+---+----+-----+----+
>  {code}
> '_hoodie_record_key' field for spark_sql inserted data is 'id:1' while that 
> for spark-shell is 2. It seems that spark_sql uses 
> '[primaryKey_field_name]:[primaryKey_field_value]' to construct the 
> '_hoodie_record_key' field, which is different from spark-shell.
> As a result, if we inserted one row via spark-sql and then upserted it via 
> spark-shell, we would get two duplicated rows. That is not what we expected.
> Did I miss some configurations that might lead to this issue? If not, 
> personally I think we should make the default record key generation logic 
> consistent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-4765) Compared inserting data via spark-sql with spark-shell,_hoodie_record_key generation logic is different, which might affects data upsert

Reply via email to