MihawkZoro opened a new issue, #7839:
URL: https://github.com/apache/hudi/issues/7839
**Environment Description**
* Hudi version :
0.12.2
* Spark version :
3.2.2
* Hadoop version :
2.7.3
* Storage :
hdfs
**Describe the problem you faced**
I have a hudi table and I deleted some records, then I clustered it, finally
I found that the deleted data reappeared when I check the result.
**To Reproduce**
1. I have a hudi table called cluster_test and delete some records
```
deldelete from cluster_test where id in (2,8,11);
```
the result after delete is :
<img width="877" alt="企业微信截图_e767a9a0-741c-4d83-b25b-bd1c747bf68a"
src="https://user-images.githubusercontent.com/32875366/216547022-cda0100d-0d17-4a79-83c5-c1558cfac593.png">
2. then I submit a cluster job
```
spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob
hudi-utilities-bundle_2.12-0.12.2.jar \
--props file:///Users/qishuiqing/develop/hudi/clusteringjob.properties \
--mode scheduleAndExecute --base-path
'hdfs://localhost:9000/user/hive/warehouse/hudi.db/cluster_test' \
--table-name cluster_test --parallelism 4 \
--spark-memory 4g
```
the result after cluster is :
<img width="1131" alt="企业微信截图_f9e34400-113c-43e9-9e26-1d3b095b7752"
src="https://user-images.githubusercontent.com/32875366/216547899-c7b30c10-93a0-4810-b4f4-518a552feb8c.png">
3. table struct
```
col_name data_type comment
_hoodie_commit_time string
_hoodie_commit_seqno string
_hoodie_record_key string
_hoodie_partition_path string
_hoodie_file_name string
id int
name string
ts bigint
# Detailed Table Information
Database hudi
Table cluster_test
Created By Spark 3.2.2
Type EXTERNAL
Provider hudi
Table Properties [preCombineField=ts, primaryKey=id, type=mor]
Statistics 2173911 bytes
Location
hdfs://localhost:9000/user/hive/warehouse/hudi.db/cluster_test
Serde Library
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
OutputFormat
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
```
**conclusion**
this is a sericous bug needed to be fixed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]