voonhous opened a new issue, #7634: URL: https://github.com/apache/hudi/issues/7634
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. A clear and concise description of the problem. # TLDR Dropping a partition followed by a write will reuse consistent hashing meta, causing data newly written data to be ignored. # Detailed Assumption A test case is included to demonstrate such a behaviour/edge case, followed by a pseudo-fix. Please correct me if I am wrong, this is what I assume is happening: `ALTER TABLE DROP PARTITION DDL` will create a `.replacecommit` file with filegroups to replace when dropping a partition. When a filegroup is flagged for deletion, it will be ignored when a read is performed. However, after writing to a dropped partition, tables with `CONSISTENT_HASHING` MAY cause rows to be inserted into the filegroup that is flagged for deletion. An extreme case of such an example can be demonstrated by setting the `hoodie.bucket.index.num.buckets = 1`, this will initialise a Hudi table with 1 bucket. As such, no data will be read when reading this partition that has been dropped and inserted into. **To Reproduce** 1. Create a table with CONSISTENT_HASHING index type 2. Drop partition A 3. Insert into partition A Steps to reproduce the behavior: ```scala test("Test alter table Into CONSISTENT_HASHING table") { withTempDir { tmp => val tableName = "test_mor_tab" // Create a partitioned table which will be initialised with 1 bucket // This will force all data to be written to this bucket (if row count is small) spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long, | dt string |) using hudi |options |( | type = 'mor' | ,primaryKey = 'id' | ,hoodie.index.type = 'BUCKET' | ,hoodie.index.bucket.engine = 'CONSISTENT_HASHING' | ,hoodie.bucket.index.num.buckets = 1 | ,hoodie.bucket.index.min.num.buckets = 1 | ,hoodie.bucket.index.max.num.buckets = 32 | ,hoodie.storage.layout.partitioner.class = 'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner' |) | tblproperties (primaryKey = 'id') | partitioned by (dt) | location '${tmp.getCanonicalPath}' """.stripMargin) // Note: Do not write the field alias, the partition field must be placed last. spark.sql( s""" | insert into $tableName values | (1, 'a1', 10, 1000, "2021-01-05"), | (2, 'a2', 20, 2000, "2021-01-06"), | (3, 'a3', 30, 3000, "2021-01-07") """.stripMargin) checkAnswer(s"select id, name, price, ts, dt from $tableName")( Seq(1, "a1", 10.0, 1000, "2021-01-05"), Seq(2, "a2", 20.0, 2000, "2021-01-06"), Seq(3, "a3", 30.0, 3000, "2021-01-07") ) spark.sql(s"alter table $tableName drop partition (dt='2021-01-05')") def applyFix(): Boolean = { import scala.reflect.io.Directory import java.io.File val dir = new Directory(new File(s"${tmp.getCanonicalPath}/.hoodie/.bucket_index/consistent_hashing_metadata/dt=2021-01-05")) dir.deleteRecursively() } // applyFix spark.sql( s""" | insert into $tableName values | (4, 'a4', 40, 4000, "2021-01-05") """.stripMargin) checkAnswer(s"select id, name, price, ts, dt from $tableName")( Seq(4, "a4", 40.0, 4000, "2021-01-05"), Seq(2, "a2", 20.0, 2000, "2021-01-06"), Seq(3, "a3", 30.0, 3000, "2021-01-07") ) } } ``` **NOTE1:** The test should fail. **NOTE2:** Please uncomment the `applyFix` line to get the correct/expected behaviour. **Expected behavior** A clear and concise description of what you expected to happen. If a partition is written to after being dropped, the data that has been newly written should be able queryable. **Environment Description** * Hudi version : 0.13.0 (2d1dd2a8ab11feb025d021d94d5fd6f2bfa9c66f) * Spark version : 3.1 * Hive version : N.A. * Hadoop version : N.A. * Storage (HDFS/S3/GCS..) : N.A. * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
