DamianoErnesti-Anritsu opened a new issue, #13996:
URL: https://github.com/apache/hudi/issues/13996
### Bug Description
**What happened:**
We have a table with:
- Partition stats
- Column stats
- Record Level Index
- Secondary Index
All of them seems to work fine in normal conditions.
We have noticed that when running the clustering table service followed by
the cleaner we get the following:
- Partition stats, Column stats and Secondary Index all seem to survive the
process
- Record Level Index has been dropped and cleaned
- Secondary Index metadata records now point to non-existent Record Level
Index record in the metadata table and are effectively orphaned
At this point not only read queries become extremely slow because of the
lack of RLIs, but they seem to be much slower than queries on the same table
with no indexes built at all, possibly because of the orphaned indexes bloating
the metadata table.
Furthermore rebuilding RLIs after this becomes riddled with errors:
With SQL:
```
scala> spark.sql("CREATE INDEX record_index ON fake_table_name
(_hoodie_record_key)")
java.lang.IllegalArgumentException: Input columns should match configured
record key columns: _hoodie_record_key
```
With `hoodie.metadata.record.index.enable" -> "true` set in the options of a
writer:
```
org.apache.hudi.exception.HoodieMetadataException: Bootstrap on record_index
partition failed for
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata
```
Listing the metadata path shows that the Record Level Index have in fact no
partition:
```
[email protected] [fake_user]
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata
Found 5 items
drwxr-xr-x - fake_user supergroup 0 2025-09-25 12:19
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata/.hoodie
drwxr-xr-x - fake_user supergroup 0 2025-09-25 14:54
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata/column_stats
drwxr-xr-x - fake_user supergroup 0 2025-09-25 14:54
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata/files
drwxr-xr-x - fake_user supergroup 0 2025-09-25 13:43
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata/partition_stats
drwxr-xr-x - fake_user supergroup 0 2025-09-25 13:08
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name/.hoodie/metadata/secondary_index_fake_column
```
Similar issue can be seen with a cat on .hoodie/hoodie.properties:
```
hoodie.table.metadata.partitions=column_stats,partition_stats,files,secondary_index_fake_column
```
Before the clustering + cleaning the indexing worked fine and the Record
Level Index were there.
The clustering was attempted in multiple ways (on different fresh copies of
the same data) and always with the above result.
With table service WITHOUT indexing properties:
```
hoodie.clustering.async.enabled=true
hoodie.clustering.inline=false
hoodie.clustering.async.max.commits=0
hoodie.clustering.plan.strategy.target.file.max.bytes=536870912
hoodie.clustering.plan.strategy.small.file.limit=629145600
hoodie.clustering.plan.strategy.max.bytes.per.group=536870912
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=fake_column_0,fake_column_1
hoodie.clustering.plan.strategy.partition.selected=2023/07/15/00,2023/07/15/01
```
With table service WITH indexing properties:
```
hoodie.clustering.async.enabled=true
hoodie.clustering.inline=false
hoodie.clustering.async.max.commits=0
hoodie.clustering.plan.strategy.target.file.max.bytes=536870912
hoodie.clustering.plan.strategy.small.file.limit=629145600
hoodie.clustering.plan.strategy.max.bytes.per.group=536870912
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.clustering.plan.strategy.sort.columns=fake_column_0,fake_column_1
hoodie.clustering.plan.strategy.partition.selected=2023/07/15/00,2023/07/15/01
hoodie.parquet.compression.codec=gzip
hoodie.metadata.enable=true
hoodie.metadata.index.async=true
hoodie.metadata.index.partition.stats.enable=true
hoodie.metadata.index.column.stats.enable=true
hoodie.metadata.index.column.stats.column.list=fake_column_0,fake_column_1,fake_column_2
**hoodie.metadata.record.index.enable=true**
hoodie.metadata.index.secondary.enable=true
```
Directly inline from a writer with these options:
```
val hudiWriteOptions: Map[String, String] = Map(
"hoodie.table.name" -> "fake_table_name",
"path" ->
s"/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name",
"hoodie.table.timeline.timezone" -> "UTC",
"hoodie.parquet.max.file.size" -> "536870912",
"hoodie.parquet.small.file.limit" -> "402653184",
"hoodie.datasource.write.table.type" -> s"MERGE_ON_READ",
"hoodie.datasource.write.operation" -> "insert",
"hoodie.auto.commit" -> "true",
"hoodie.allow.empty.commit" -> "true",
"hoodie.metadata.enable" -> "true",
"hoodie.metadata.index.partition.stats.enable" -> s"true",
"hoodie.metadata.index.column.stats.enable" -> s"true",
"hoodie.metadata.index.column.stats.column.list" ->
s"fake_column_0,fake_column_1",
"hoodie.metadata.record.index.enable" -> "true",
"hoodie.metadata.index.secondary.enable" -> s"true",
"hoodie.metadata.index.async" -> "false",
"hoodie.datasource.write.partitionpath.field" -> "partition",
"hoodie.datasource.write.drop.partition.columns" -> "true",
"hoodie.datasource.write.streaming.disable.compaction" ->
"false",
"hoodie.archive.automatic" -> "true",
"hoodie.clean.automatic" -> "true",
"hoodie.clean.commits.retained" -> "1",
"hoodie.clean.fileversions.retained" -> "1",
"hoodie.datasource.compaction.async.enable" -> "false",
"hoodie.compact.inline" -> "true",
"hoodie.log.compaction.inline" -> "true",
"hoodie.embed.timeline.server.async" -> "true",
"hoodie.clustering.plan.strategy.sort.columns" ->
"fake_column_0,fake_column_1",
"hoodie.clustering.async.enabled" -> "false",
"hoodie.clustering.inline" -> "true",
"hoodie.clustering.inline.max.commits" -> "1",
"hoodie.clustering.plan.strategy.target.file.max.bytes" ->
"536870912",
"hoodie.clustering.plan.strategy.max.bytes.per.group" ->
"536870912",
"hoodie.clustering.max.parallelism" -> "20",
"hoodie.clustering.plan.strategy.max.num.groups" -> "40",
"hoodie.archive.async" -> "false",
"hoodie.clean.async.enabled" -> "false",
"hoodie.parquet.compression.codec" -> "zstd",
"parquet.compression.codec.zstd.level" -> "22",
)
```
All of them resulted in the same problem with Record Level Index being
dropped.
The cleaner was always run like this:
```
/opt/fake_organization/fake_product/installed/fake_app/lib/spark3.5/bin/spark-submit
\
--verbose \
--master yarn \
--deploy-mode client \
--queue fake_queue \
--conf
spark.scheduler.allocation.file=hdfs:///apps/spark3.5/spark-fairscheduler.xml \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.yarn.jars=hdfs:///apps/spark3.5/jars/*.jar \
--conf spark.driver.cores=2 \
--conf spark.driver.memory=8g \
--executor-memory 8G \
--executor-cores 5 \
--num-executors 4 \
--class org.apache.hudi.utilities.HoodieCleaner \
/var/opt/fake_organization/fake_product/tmp/hudi-utilities-slim-bundle_2.12-1.0.2.jar
\
--target-base-path
/fake_organization/fake_product/hudi/warehouse/fake_db/fake_table_name \
--hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONS \
--hoodie-conf hoodie.cleaner.fileversions.retained=1 \
--hoodie-conf hoodie.cleaner.parallelism=200
```
**What you expected:**
Record Level Index to survive clustering + cleaning like other index types,
either implicitly or through explicitly setting them when clustering.
**Steps to reproduce:**
1. Create Record Level Index
2. Cluster dataset normally
3. Run cleaner right after
### Environment
**Hudi version:** 1.0.2
**Query engine:** Spark
**Relevant configs:**
### Logs and Stack Trace
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]