dataproblems opened a new issue, #12783:
URL: https://github.com/apache/hudi/issues/12783
**Describe the problem you faced**
I created a hudi table with record level index and performed upsert
operation on it. Now, the first time when I performed the upsert operation, it
read the record index file, figured out which files needed an update, and wrote
the files to S3. The second time when I performed an upsert on the same table,
I saw the record index folder being deleted and recreated under the metadata
folder. My hudi table is quite large and this re-creation of the entire record
level index is too expensive to support during an upsert operation.
**To Reproduce**
Steps to reproduce the behavior:
1. Create hudi table using insert mode and specify record level index
2. Perform an upsert
3. Perform another upsert
**Expected behavior**
I expect any number of upserts after the initial creation of the record
level index to just update the index as required and not re-create the whole
index.
**Environment Description**
* Hudi version : 0.15.0
* Spark version : 3.4.1
* Hive version :
* Hadoop version : 3.3.6
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : no
**Additional context**
Config I used for upsert operation:
```
hoodie.embed.timeline.server -> false,
hoodie.parquet.small.file.limit -> 1073741824,
hoodie.metadata.record.index.enable -> true,
hoodie.datasource.write.precombine.field -> $timestampField,
hoodie.datasource.write.payload.class ->
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload,
hoodie.metadata.index.column.stats.enable -> true,
hoodie.parquet.max.file.size -> 2147483648,
hoodie.metadata.enable -> true,
hoodie.index.type -> RECORD_INDEX,
hoodie.datasource.write.operation -> upsert,
hoodie.parquet.compression.codec -> snappy,
hoodie.datasource.write.recordkey.field -> $recordKeyField,
hoodie.table.name -> $tableName,
hoodie.datasource.write.table.type -> COPY_ON_WRITE,
hoodie.datasource.write.hive_style_partitioning -> true,
hoodie.write.markers.type -> DIRECT,
hoodie.populate.meta.fields -> true,
hoodie.datasource.write.keygenerator.class ->
org.apache.hudi.keygen.SimpleKeyGenerator,
hoodie.upsert.shuffle.parallelism -> 10000,
hoodie.datasource.write.partitionpath.field -> $partitionField
```
I do not see any errors but it doesn't make sense that hudi will clear away
my index and recreate it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]