Ryan Pifer created HUDI-1196:
--------------------------------

             Summary: Record being placed in incorrect partition during upsert 
on COW/MOR global indexed tables
                 Key: HUDI-1196
                 URL: https://issues.apache.org/jira/browse/HUDI-1196
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Ryan Pifer


When upserting a record in a global index table (global and hbase) where the 
batch has multiple versions of the record in different partitions, the record 
is deduplicated correctly but placed in the incorrect partition. This was with 
using "hoodie.bloom.update.partition.path=true" as well

 

Batch with multiple versions of a record in different partitions:

```

scala> val inputDF = spark.read.format("parquet").load(inputDataPath).show()

+--------+---------+----------------+-------------+-------------+               

|     wbn|    cs_ss|     action_date|           ad|   ad_updated|

+--------+---------+----------------+-------------+-------------+

|12345678|InTransit|1596716921000601|2020-08-06-12|2020-08-06-12|

|12345678|  Pending|1596716921000602|2020-08-06-12|2020-08-06-12|

|12345678|  Pending|1596716921000603|2020-08-06-13|2020-08-06-13|

+--------+---------+----------------+-------------+-------------+

```

 

Values when querying _rt and _ro tables:

```

scala> spark.sql("select * from gb_update_partition_1_ro").show()

+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------------+-------------+-------------+

|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|     wbn|  cs_ss|     action_date|   ad_updated|           
ad|

+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------------+-------------+-------------+

|     20200817220935|  20200817220935_0_1|          12345678|         
2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|

+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------------+-------------+-------------+

  

scala> spark.sql("select * from gb_update_partition_1_rt").show()

+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------------+-------------+-------------+

|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name|     wbn|  cs_ss|     action_date|   ad_updated|           
ad|

+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------------+-------------+-------------+

|     20200817221924|  20200817221924_0_1|          12345678|         
2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|

+-------------------+--------------------+------------------+----------------------+--------------------+--------+-------+----------------+-------------+-------------+

 ```

 

We can see that record displays most current version of the data except the 
partition values are from the older versions

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to