geserdugarov opened a new issue, #13241:
URL: https://github.com/apache/hudi/issues/13241
**Describe the problem you faced**
Initially, I found that there is no data about table index configuration in
`hoodie.properties`. So, I tried to check allowance of index type changing by a
couple of upserts with different index configurations. I chose simple bucket
index, and decided to vary number of buckets. Hudi allowed me to write
successfully data in this scenario, which resulted in a corrupted dataset.
**To Reproduce**
1. Create COW table (for ease check what is inside parquet files) with
simple bucket index, and set number of buckets to 1.
```java
spark.sql("CREATE TABLE index_persist ("
" id int,"
" dt string"
") USING HUDI "
"TBLPROPERTIES ("
" 'primaryKey' = 'id',"
" 'type' = 'cow',"
" 'preCombineField' = 'dt',"
" 'hoodie.index.type' = 'BUCKET',"
" 'hoodie.bucket.index.num.buckets' = '1'"
") LOCATION '" + tmp_dir_path + "';")
```
3. Upsert a bunch of records into the table. Check that we have only 1
parquet file now with all records.
```sql
INSERT INTO index_persist VALUES (1, 0), (2, 0), (3, 0), (4, 0), (5, 0);
```
3. Increase number of buckets, for instance, up to 2.
```sql
SET hoodie.bucket.index.num.buckets=2;
```
5. Upsert records with the same record keys from step 2, but with changed
other values. Check that we have 3 parquet files now.
```sql
INSERT INTO index_persist VALUES (1, 100), (2, 100), (3, 100), (4, 100),
(5, 100);
```
7. Select all records from the table, and check that we have a mess in
records now.
```sql
SELECT * FROM index_persist ORDER BY id;
```
Results
> ('1', '',
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 1,
'100')
('2', '',
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 2,
'0')
('2', '',
'00000001-62c8-49ba-9298-d0091a20f8e3-0_1-34-55_20250430181205708.parquet', 2,
'100')
('3', '',
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 3,
'100')
('4', '',
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 4,
'0')
('4', '',
'00000001-62c8-49ba-9298-d0091a20f8e3-0_1-34-55_20250430181205708.parquet', 4,
'100')
('5', '',
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 5,
'100')
Corresponding script is available at:
https://github.com/geserdugarov/test-hudi-issues/blob/main/check-index-persistence/check-index-persistence.py
**Expected behavior**
We need to check index configuration, and don't allow to write differently
by different writers.
**Environment Description**
* Hudi version : master, commit f0fcbf6eaf39dfe79e2b27ff7d626b0a8c06bce0
* Spark version : 3.5.3
* Storage (HDFS/S3/GCS..) : local file system
* Running on Docker? (yes/no) : no
**Additional context**
Add any other context about the problem here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]