[I] [SUPPORT] Index configuration is not persistent [hudi]

via GitHub Wed, 30 Apr 2025 04:22:44 -0700


geserdugarov opened a new issue, #13241:
URL: https://github.com/apache/hudi/issues/13241


   **Describe the problem you faced**
   
   Initially, I found that there is no data about table index configuration in 
`hoodie.properties`. So, I tried to check allowance of index type changing by a 
couple of upserts with different index configurations. I chose simple bucket 
index, and decided to vary number of buckets. Hudi allowed me to write 
successfully data in this scenario, which resulted in a corrupted dataset.
   
   **To Reproduce**
   
   1. Create COW table (for ease check what is inside parquet files) with 
simple bucket index, and set number of buckets to 1.
       ```java
       spark.sql("CREATE TABLE index_persist ("
                 "  id int,"
                 "  dt string"
                 ") USING HUDI "
                 "TBLPROPERTIES ("
                 "  'primaryKey' = 'id',"
                 "  'type' = 'cow',"
                 "  'preCombineField' = 'dt',"
                 "  'hoodie.index.type' = 'BUCKET',"
                 "  'hoodie.bucket.index.num.buckets' = '1'"
                 ") LOCATION '" + tmp_dir_path + "';")
       ```
   3. Upsert a bunch of records into the table. Check that we have only 1 
parquet file now with all records.
       ```sql
       INSERT INTO index_persist VALUES (1, 0), (2, 0), (3, 0), (4, 0), (5, 0);
       ```
   3. Increase number of buckets, for instance, up to 2.
       ```sql
       SET hoodie.bucket.index.num.buckets=2;
       ```
   5. Upsert records with the same record keys from step 2, but with changed 
other values. Check that we have 3 parquet files now.
       ```sql
       INSERT INTO index_persist VALUES (1, 100), (2, 100), (3, 100), (4, 100), 
(5, 100);
       ```
   7. Select all records from the table, and check that we have a mess in 
records now.
       ```sql
       SELECT * FROM index_persist ORDER BY id;
       ```
   Results
   >    ('1', '', 
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 1, 
'100')
       ('2', '', 
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 2, 
'0')
       ('2', '', 
'00000001-62c8-49ba-9298-d0091a20f8e3-0_1-34-55_20250430181205708.parquet', 2, 
'100')
       ('3', '', 
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 3, 
'100')
       ('4', '', 
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 4, 
'0')
       ('4', '', 
'00000001-62c8-49ba-9298-d0091a20f8e3-0_1-34-55_20250430181205708.parquet', 4, 
'100')
       ('5', '', 
'00000000-0a6e-4a5d-af56-c804f7a69372-0_0-34-54_20250430181205708.parquet', 5, 
'100')
   
   
   Corresponding script is available at:
   
https://github.com/geserdugarov/test-hudi-issues/blob/main/check-index-persistence/check-index-persistence.py
   
   **Expected behavior**
   
   We need to check index configuration, and don't allow to write differently 
by different writers.
   
   **Environment Description**
   
   * Hudi version : master, commit f0fcbf6eaf39dfe79e2b27ff7d626b0a8c06bce0
   
   * Spark version : 3.5.3
   
   * Storage (HDFS/S3/GCS..) : local file system
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Index configuration is not persistent [hudi]

Reply via email to