Jing Zhang created HUDI-6364:
--------------------------------
Summary: InsertOverwrite operation on consistent hashing resulting
in wrong data
Key: HUDI-6364
URL: https://issues.apache.org/jira/browse/HUDI-6364
Project: Apache Hudi
Issue Type: Bug
Components: index
Reporter: Jing Zhang
{code:java}
spark.sql(
s"""insert into $tableName values
|(5, 'a', 35, 1000, '2021-01-05'),
|(1, 'a', 31, 1000, '2021-01-05'),
|(3, 'a', 33, 1000, '2021-01-05'),
|(4, 'b', 16, 1000, '2021-01-05'),
|(2, 'b', 18, 1000, '2021-01-05'),
|(6, 'b', 17, 1000, '2021-01-05'),
|(8, 'a', 21, 1000, '2021-01-05'),
|(9, 'a', 22, 1000, '2021-01-05'),
|(7, 'a', 23, 1000, '2021-01-05')
|""".stripMargin)
// Insert overwrite static partition
spark.sql(
s"""
| insert overwrite table $tableName partition(dt = '2021-01-05')
| select * from (select 13 , 'a2', 12, 1000) limit 10
""".stripMargin)
spark.sql(
s"""
| insert into $tableName values
| (5, 'a3', 35, 1000, '2021-01-05'),
| (3, 'a3', 33, 1000, '2021-01-05')
""".stripMargin)
{code}
After running the above case, we expect the result of the snapshot would be
(13, "a3", 12.0, 1000, "2021-01-05"), (5, "a3", 35, 1000, "2021-01-05"), (3,
"a3", 33, 1000, "2021-01-05").
But the actual result is (13,a2,12.0,1000,2021-01-05).
The root cause is that after running insert overwrite into a consistent bucket
index, the file groups in consistent_hashing_metadata does not match file
groups on storage any more.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)