MihawkZoro opened a new issue, #7839:
URL: https://github.com/apache/hudi/issues/7839

   **Environment Description**
   
   * Hudi version :
   0.12.2
   * Spark version :
   3.2.2
   * Hadoop version :
   2.7.3
   * Storage :
   hdfs
   
   **Describe the problem you faced**
   
   I have a hudi table and I deleted some records, then I clustered it, finally 
I found that the deleted data reappeared when I check the result.
   
   **To Reproduce**
   
   1. I have a hudi table called  cluster_test and delete  some records
     ```
     deldelete from cluster_test where id in (2,8,11);
     ```
   the result after delete is :
   <img width="877" alt="企业微信截图_e767a9a0-741c-4d83-b25b-bd1c747bf68a" 
src="https://user-images.githubusercontent.com/32875366/216547022-cda0100d-0d17-4a79-83c5-c1558cfac593.png";>
   
   
   2. then I submit a cluster job
   ```
   spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob 
hudi-utilities-bundle_2.12-0.12.2.jar \
   --props file:///Users/qishuiqing/develop/hudi/clusteringjob.properties \
   --mode scheduleAndExecute --base-path 
'hdfs://localhost:9000/user/hive/warehouse/hudi.db/cluster_test' \
   --table-name cluster_test --parallelism 4 \
   --spark-memory 4g
   ```
    the result after cluster is : 
   <img width="1131" alt="企业微信截图_f9e34400-113c-43e9-9e26-1d3b095b7752" 
src="https://user-images.githubusercontent.com/32875366/216547899-c7b30c10-93a0-4810-b4f4-518a552feb8c.png";>
   
   3. table struct
   ```
   col_name     data_type       comment
   _hoodie_commit_time  string
   _hoodie_commit_seqno string
   _hoodie_record_key   string
   _hoodie_partition_path       string
   _hoodie_file_name    string
   id                   int
   name                 string
   ts                   bigint
   
   # Detailed Table Information
   Database             hudi
   Table                cluster_test
   Created By           Spark 3.2.2
   Type                 EXTERNAL
   Provider             hudi
   Table Properties     [preCombineField=ts, primaryKey=id, type=mor]
   Statistics           2173911 bytes
   Location             
hdfs://localhost:9000/user/hive/warehouse/hudi.db/cluster_test
   Serde Library        
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   InputFormat          
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
   OutputFormat         
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
   ```
   
   **conclusion**
   this is a sericous bug needed to be fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to