Vinaykumar Bhat created HUDI-7580:
-------------------------------------
Summary: Inserting rows into partitioned table leads to data
sanity issues
Key: HUDI-7580
URL: https://issues.apache.org/jira/browse/HUDI-7580
Project: Apache Hudi
Issue Type: Bug
Affects Versions: 1.0.0-beta1
Reporter: Vinaykumar Bhat
Came across this behaviour of partitioned tables when trying to debug some
other issue with functional-index. It seems that the column ordering gets
messed up while inserting records into a hudi table. Hence, a subsequent query
returns wrong results. An example follows:
The following is a scala test:
{code:java}
test("Test Create Functional Index") {
if (HoodieSparkUtils.gteqSpark3_2) {
withTempDir { tmp =>
val tableType = "cow"
val tableName = "rides"
val basePath = s"${tmp.getCanonicalPath}/$tableName"
spark.sql("set hoodie.metadata.enable=true")
spark.sql(
s"""
|create table $tableName (
| id int,
| name string,
| price int,
| ts long
|) using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
spark.sql(s"insert into $tableName (id, name, price, ts) values(1,
'a1', 10, 1000)")
spark.sql(s"insert into $tableName (id, name, price, ts) values(2,
'a2', 100, 200000)")
spark.sql(s"insert into $tableName (id, name, price, ts) values(3,
'a3', 1000, 2000000000)")
spark.sql(s"select id, name, price, ts from $tableName").show(false)
}
}
} {code}
The query returns the following result (note how price ans ts columns are mixed
up).
{code:java}
+---+----+----------+----+
|id |name|price |ts |
+---+----+----------+----+
|3 |a3 |2000000000|1000|
|2 |a2 |200000 |100 |
|1 |a1 |1000 |10 |
+---+----+----------+----+
{code}
Have the partition column as the last column in the schema does not cause this
problem. If the mixed-up columns are of imcompatible datatypes, then the insert
fails with an error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)