[GitHub] [hudi] zealjoanna opened a new issue, #9557: [SUPPORT] CDC file clean not work

via GitHub Mon, 28 Aug 2023 02:49:18 -0700


zealjoanna opened a new issue, #9557:
URL: https://github.com/apache/hudi/issues/9557


    **Describe the problem you faced**
   
   i'm using  CDC read for COW table,  i want to keep last two commit  by 
setting hoodie.cleaner.commits.retained = 1
   then i found the parquet file cleaned as as I thought
   but the cdc log file retained
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
           spark.sql("create table hudi_cow_pt_tbl (\n" +
               "  begin_lat double,\n" +
               "  begin_lon double,\n" +
               "  driver string,\n" +
               "  end_lat double,\n" +
               "  end_lon double,\n" +
               "  fare double,\n" +
               "  partitionpath String,\n" +
               "  rider string,\n" +
               "  ts String,\n" +
               "  uuid string)" +
               " using hudi\n" +
               "tblproperties (\n" +
               "hoodie.table.cdc.enabled =  'true',\n" +
               "hoodie.table.cdc.supplemental.logging = 'true',\n" +
               "hoodie.datasource.query.incremental.format = 'cdc',\n" +
               "type = 'cow',\n" +
               "primaryKey = 'uuid',\n" +
               "preCombineField = 'ts'\n " +
               ")\n" +
               "partitioned by (partitionpath)\n" +
               "location 'file:///D:/work/IDEA_PROJECT/spark_base/hudi_sql';")
           spark.sql("set  hoodie.clean.automatic = true;")
           spark.sql("set  hoodie.clean.max.commits = 1;")
           spark.sql("set  hoodie.cleaner.commits.retained = 1;")
           spark.sql("insert into hudi_cow_pt_tbl " +
               "select 2.0 as begin_lat , 0.1 as begin_lon, '' as driver ,0.2 
as end_Slat ,0.3 as end_lon ,0.4 as fare,'china/wuhan'  as partitionpath 
,'rider-1' as rider ,1691560140000 as ts,'1' as uuid ")
           spark.sql("insert into hudi_cow_pt_tbl " +
               "select 2.0 as begin_lat , 0.1 as begin_lon, '' as driver ,0.2 
as end_Slat ,0.3 as end_lon ,0.4 as fare,'china/wuhan'  as partitionpath 
,'rider-1' as rider ,1691560140000 as ts,'1' as uuid ")
           spark.sql("insert into hudi_cow_pt_tbl " +
               "select 2.0 as begin_lat , 0.1 as begin_lon, '' as driver ,0.2 
as end_Slat ,0.3 as end_lon ,0.4 as fare,'china/wuhan'  as partitionpath 
,'rider-1' as rider ,1691560140000 as ts,'1' as uuid ")
           spark.sql("insert into hudi_cow_pt_tbl " +
               "select 2.0 as begin_lat , 0.1 as begin_lon, '' as driver ,0.2 
as end_Slat ,0.3 as end_lon ,0.4 as fare,'china/wuhan'  as partitionpath 
,'rider-1' as rider ,1691560140000 as ts,'1' as uuid ")
   
   **Expected behavior**
   
   i expect two cdc log file remains  and two parquet file remains
   
   **Environment Description**
   
   * Hudi version :1.31.1
   
   * Spark version :3.3.2
   
   * Hive version : no 
   
   * Hadoop version :2.8.3
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :no
   
   
   **Additional context**
   
   
![image](https://github.com/apache/hudi/assets/21325163/3d59015a-8f8d-4c2d-98df-cfca56e14884)
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   here  are some infos  that i found
   the cdc file path in metadata is not right
    
   d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_0-15-11_20230825201900316.parquet -> 
[436825,false], 
   d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_0-15-11_20230825203147367.parquet -> 
[436824,false], 
   
partitionpath=1/.d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_20230825201939193.log.1_0-15-11.cdc
 -> [1283,false], 
   
partitionpath=1/.d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_20230825203147367.log.1_0-15-11.cdc
 -> [1283,false], 
   d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_0-15-11_20230825201939193.parquet -> 
[436821,false], 
   
partitionpath=1/.d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_20230825201900316.log.1_0-15-11.cdc
 -> [1283,false], 
partitionpath=1/.d5cec5e4-2cb0-42c2-bcd1-39f1fe400e12-0_20230825202620629.log.1_0-15-11.cdc
 -> [1283,false]
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zealjoanna opened a new issue, #9557: [SUPPORT] CDC file clean not work

Reply via email to