[I] [SUPPORT] Does Hudi re-create record level index during an upsert operation? [hudi]

via GitHub Wed, 05 Feb 2025 15:03:36 -0800


dataproblems opened a new issue, #12783:
URL: https://github.com/apache/hudi/issues/12783


   **Describe the problem you faced**
   
   I created a hudi table with record level index and performed upsert 
operation on it. Now, the first time when I performed the upsert operation, it 
read the record index file, figured out which files needed an update, and wrote 
the files to S3. The second time when I performed an upsert on the same table, 
I saw the record index folder being deleted and recreated under the metadata 
folder. My hudi table is quite large and this re-creation of the entire record 
level index is too expensive to support during an upsert operation. 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create hudi table using insert mode and specify record level index
   2. Perform an upsert
   3. Perform another upsert
   
   **Expected behavior**
   
   I expect any number of upserts after the initial creation of the record 
level index to just update the index as required and not re-create the whole 
index. 
   
   **Environment Description**
   
   * Hudi version : 0.15.0
   
   * Spark version : 3.4.1
   
   * Hive version :
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   Config I used for upsert operation: 
   
   ```
       hoodie.embed.timeline.server -> false, 
       hoodie.parquet.small.file.limit -> 1073741824, 
       hoodie.metadata.record.index.enable -> true, 
       hoodie.datasource.write.precombine.field -> $timestampField, 
       hoodie.datasource.write.payload.class -> 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload,   
       hoodie.metadata.index.column.stats.enable -> true, 
       hoodie.parquet.max.file.size -> 2147483648, 
       hoodie.metadata.enable -> true, 
       hoodie.index.type -> RECORD_INDEX, 
       hoodie.datasource.write.operation -> upsert, 
       hoodie.parquet.compression.codec -> snappy, 
       hoodie.datasource.write.recordkey.field -> $recordKeyField, 
       hoodie.table.name -> $tableName, 
       hoodie.datasource.write.table.type -> COPY_ON_WRITE, 
       hoodie.datasource.write.hive_style_partitioning -> true, 
       hoodie.write.markers.type -> DIRECT, 
       hoodie.populate.meta.fields -> true, 
       hoodie.datasource.write.keygenerator.class -> 
org.apache.hudi.keygen.SimpleKeyGenerator, 
       hoodie.upsert.shuffle.parallelism -> 10000, 
       hoodie.datasource.write.partitionpath.field -> $partitionField
   ```
   
   I do not see any errors but it doesn't make sense that hudi will clear away 
my index and recreate it. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Does Hudi re-create record level index during an upsert operation? [hudi]

Reply via email to