shubhambg95 opened a new issue, #7125:
URL: https://github.com/apache/hudi/issues/7125

   **Describe the problem you faced**
   
   We currently have a spark streaming job that writes data from a Kafka topic 
to a Hudi MOR table. We want to enable column stats indexing to leverage data 
skipping while reading from this table. 
   
   To do this we basically set the ```hoodie.metadata.index.column.stats.enable 
= true```. When we enable this change we are seeing some issue with the Hudi 
cleaner. We are running into clean operations that never finish and remain at 
```clean.inflight``` stage. 
   
   This is also causing an issue with the MDT tasks which fail with 
```java.lang.StringIndexOutOfBoundsException at 
getColumnStatsRecords(HoodieTableMetadataUtil.java:1158)```
   
   When we run without setting ``` hoodie.metadata.index.column.stats.enable = 
true```, we are not seeing any issues.
   
   Hudi Configurations
   ```
   hoodie.datasource.write.table.type: "MERGE_ON_READ"
   hoodie.datasource.write.precombine.field: preCombineField
   hoodie.datasource.write.recordkey.field: recordKeyField
   hoodie.datasource.write.partitionpath.field: partitionPathField
   hoodie.table.name: tableName
   hoodie.index.type: "SIMPLE"
   hoodie.datasource.write.operation: "INSERT"
   hoodie.datasource.write.hive_style_partitioning: true
   hoodie.insert.shuffle.parallelism: 54
   hoodie.finalize.write.parallelism: 54
   hoodie.cleaner.commits.retained: 300
   hoodie.keep.min.commits: 325
   hoodie.keep.max.commits: 350
   hoodie.sql.insert.mode: "upsert"
   hoodie.parquet.compression.codec: "snappy"
   hoodie.metadata.index.column.stats.enable: true
   ```
   
   **Environment Description**
   
   * Hudi version : 12.1
   
   * Spark version : 3.1.1
   
   * Hive version : 2.3.7
   
   * Hadoop version : 3.2.0
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : yes
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to