[
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo closed HUDI-7580.
-----------------------------
Resolution: Invalid
> Inserting rows into partitioned table leads to data sanity issues
> -----------------------------------------------------------------
>
> Key: HUDI-7580
> URL: https://issues.apache.org/jira/browse/HUDI-7580
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 1.0.0-beta1, 0.14.1
> Reporter: Vinaykumar Bhat
> Assignee: Sagar Sumit
> Priority: Major
> Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
> Original Estimate: 4m
> Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some
> other issue with functional-index. It seems that the column ordering gets
> messed up while inserting records into a hudi table. Hence, a subsequent
> query returns wrong results. An example follows:
>
> The following is a scala test:
> {code:java}
> test("Test Create Functional Index") {
> if (HoodieSparkUtils.gteqSpark3_2) {
> withTempDir { tmp =>
> val tableType = "cow"
> val tableName = "rides"
> val basePath = s"${tmp.getCanonicalPath}/$tableName"
> spark.sql("set hoodie.metadata.enable=true")
> spark.sql(
> s"""
> |create table $tableName (
> | id int,
> | name string,
> | price int,
> | ts long
> |) using hudi
> | options (
> | primaryKey ='id',
> | type = '$tableType',
> | preCombineField = 'ts',
> | hoodie.metadata.record.index.enable = 'true',
> | hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
> spark.sql(s"insert into $tableName (id, name, price, ts) values(1,
> 'a1', 10, 1000)")
> spark.sql(s"insert into $tableName (id, name, price, ts) values(2,
> 'a2', 100, 200000)")
> spark.sql(s"insert into $tableName (id, name, price, ts) values(3,
> 'a3', 1000, 2000000000)")
> spark.sql(s"select id, name, price, ts from $tableName").show(false)
> }
> }
> } {code}
>
> The query returns the following result (note how *price* and *ts* columns are
> mixed up).
> {code:java}
> +---+----+----------+----+
> |id |name|price |ts |
> +---+----+----------+----+
> |3 |a3 |2000000000|1000|
> |2 |a2 |200000 |100 |
> |1 |a1 |1000 |10 |
> +---+----+----------+----+
> {code}
>
> Having the partition column as the last column in the schema does not cause
> this problem. If the mixed-up columns are of incompatible datatypes, then the
> insert fails with an error.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)