[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table

via GitHub Mon, 06 Feb 2023 17:42:46 -0800


nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420045535


   Hey @MihawkZoro : I could not reproduce on my end. Here is the steps I 
followed. 
   
   1. Created a table via spark-sql 
   ```
   create table parquet_tbl1 using parquet location 
'file:///tmp/tbl1/*.parquet';
   drop table hudi_ctas_cow1;
   create table hudi_ctas_cow1 using hudi location 'file:/tmp/hudi/hudi_tbl/' 
options (
     type = 'cow',
     primaryKey = 'tpep_pickup_datetime',
     preCombineField = 'tpep_dropoff_datetime'
    )
   partitioned by (date_col) as select * from parquet_tbl1;
   ```
   
   2. Read data from one of the partition w/ "VendorId = 1". 
   ```
   select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' 
group by 1;
   ```
   this returned 
   1, 1914
   2, 3988
   
   3. Issue deletes to records w/ VendorId = 1 for this specific partition. 
   ```
   delete from hudi_ctas_cow1 where date_col = '2019-08-10' and VendorID = 1;
   ```
   
   Verified from ".hoodie", that a new commit has succeeded and it added one 
new parquet file to 2019-08-10 partition. 
   ```
   ls -ltr /tmp/hudi/hudi_tbl/date_col=2019-08-10/
   total 2192
   -rw-r--r--  1 nsb  wheel  571011 Feb  6 17:19 
f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_10-27-119_20230206171846307.parquet
   -rw-r--r--  1 nsb  wheel  529348 Feb  6 17:24 
f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_0-83-1538_20230206172355871.parquet
   ```
   the 2nd parquet file was written due to the delete operation.
   
   4. Triggered clustering job. 
   Property file contents
   ```
   cat /tmp/cluster.props 
   
   hoodie.datasource.write.recordkey.field=tpep_pickup_datetime
   hoodie.datasource.write.partitionpath.field=date_col
   hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
   
   hoodie.upsert.shuffle.parallelism=8
   hoodie.insert.shuffle.parallelism=8
   hoodie.delete.shuffle.parallelism=8
   hoodie.bulkinsert.shuffle.parallelism=8
   
   hoodie.clustering.plan.strategy.sort.columns=date_col,tpep_pickup_datetime
   
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   
   hoodie.parquet.small.file.limit=0
   hoodie.clustering.inline=true
   hoodie.clustering.inline.max.commits=1
   hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
   hoodie.clustering.plan.strategy.small.file.limit=629145600
   hoodie.clustering.async.enabled=true
   hoodie.clustering.async.max.commits=1
   ```
   
   ```
   ./bin/spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob 
~/Downloads/hudi-utilities-bundle_2.11-0.12.2.jar --props /tmp/cluster.props 
--mode scheduleAndExecute --base-path /tmp/hudi/hudi_tbl/ --table-name 
hudi_ctas_cow1 --spark-memory 4g
   ```
   
   Verified from ".hoodie" that I could see replace commit and it has 
succeeded. 
   
   5. re-launched spark-sql and queried the table. 
   ```
   refresh table hudi_ctas_cow1;
   select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10' 
group by 1;
   ```
   output
   ```
   2    3988
   Time taken: 3.818 seconds, Fetched 1 row(s)
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table

Reply via email to