[GitHub] [hudi] MihawkZoro commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table

via GitHub Mon, 06 Feb 2023 18:36:29 -0800


MihawkZoro commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420111468


   > Hey @MihawkZoro : I could not reproduce on my end. Here is the steps I 
followed.
   > 
   > 1. Created a table via spark-sql
   > 
   > ```
   > create table parquet_tbl1 using parquet location 
'file:///tmp/tbl1/*.parquet';
   > drop table hudi_ctas_cow1;
   > create table hudi_ctas_cow1 using hudi location 'file:/tmp/hudi/hudi_tbl/' 
options (
   >   type = 'cow',
   >   primaryKey = 'tpep_pickup_datetime',
   >   preCombineField = 'tpep_dropoff_datetime'
   >  )
   > partitioned by (date_col) as select * from parquet_tbl1;
   > ```
   > 
   > 2. Read data from one of the partition w/ "VendorId = 1".
   > 
   > ```
   > select VendorId, count(*) from hudi_ctas_cow1 where date_col = 
'2019-08-10' group by 1;
   > ```
   > 
   > this returned 1, 1914 2, 3988
   > 
   > 3. Issue deletes to records w/ VendorId = 1 for this specific partition.
   > 
   > ```
   > delete from hudi_ctas_cow1 where date_col = '2019-08-10' and VendorID = 1;
   > ```
   > 
   > Verified from ".hoodie", that a new commit has succeeded and it added one 
new parquet file to 2019-08-10 partition.
   > 
   > ```
   > ls -ltr /tmp/hudi/hudi_tbl/date_col=2019-08-10/
   > total 2192
   > -rw-r--r--  1 nsb  wheel  571011 Feb  6 17:19 
f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_10-27-119_20230206171846307.parquet
   > -rw-r--r--  1 nsb  wheel  529348 Feb  6 17:24 
f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_0-83-1538_20230206172355871.parquet
   > ```
   > 
   > the 2nd parquet file was written due to the delete operation.
   > 
   > 4. Triggered clustering job.
   >    Property file contents
   > 
   > ```
   > cat /tmp/cluster.props 
   > 
   > hoodie.datasource.write.recordkey.field=tpep_pickup_datetime
   > hoodie.datasource.write.partitionpath.field=date_col
   > hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
   > 
   > hoodie.upsert.shuffle.parallelism=8
   > hoodie.insert.shuffle.parallelism=8
   > hoodie.delete.shuffle.parallelism=8
   > hoodie.bulkinsert.shuffle.parallelism=8
   > 
   > hoodie.clustering.plan.strategy.sort.columns=date_col,tpep_pickup_datetime
   > 
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
   > 
   > hoodie.parquet.small.file.limit=0
   > hoodie.clustering.inline=true
   > hoodie.clustering.inline.max.commits=1
   > hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
   > hoodie.clustering.plan.strategy.small.file.limit=629145600
   > hoodie.clustering.async.enabled=true
   > hoodie.clustering.async.max.commits=1
   > ```
   > 
   > ```
   > ./bin/spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob 
~/Downloads/hudi-utilities-bundle_2.11-0.12.2.jar --props /tmp/cluster.props 
--mode scheduleAndExecute --base-path /tmp/hudi/hudi_tbl/ --table-name 
hudi_ctas_cow1 --spark-memory 4g
   > ```
   > 
   > Verified from ".hoodie" that I could see replace commit and it has 
succeeded.
   > 
   > 5. re-launched spark-sql and queried the table.
   > 
   > ```
   > refresh table hudi_ctas_cow1;
   > select VendorId, count(*) from hudi_ctas_cow1 where date_col = 
'2019-08-10' group by 1;
   > ```
   > 
   > output
   > 
   > ```
   > 2  3988
   > Time taken: 3.818 seconds, Fetched 1 row(s)
   > ```
   
   the table is mor
   <img width="623" alt="image" 
src="https://user-images.githubusercontent.com/32875366/217133399-ba18e8be-4b75-4983-9a0d-58787906b222.png";>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] MihawkZoro commented on issue #7839: [BUG] the deleted data reappeared after clustering on the table

Reply via email to