nsivabalan commented on issue #7839:
URL: https://github.com/apache/hudi/issues/7839#issuecomment-1420045535
Hey @MihawkZoro : I could not reproduce on my end. Here is the steps I
followed.
1. Created a table via spark-sql
```
create table parquet_tbl1 using parquet location
'file:///tmp/tbl1/*.parquet';
drop table hudi_ctas_cow1;
create table hudi_ctas_cow1 using hudi location 'file:/tmp/hudi/hudi_tbl/'
options (
type = 'cow',
primaryKey = 'tpep_pickup_datetime',
preCombineField = 'tpep_dropoff_datetime'
)
partitioned by (date_col) as select * from parquet_tbl1;
```
2. Read data from one of the partition w/ "VendorId = 1".
```
select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10'
group by 1;
```
this returned
1, 1914
2, 3988
3. Issue deletes to records w/ VendorId = 1 for this specific partition.
```
delete from hudi_ctas_cow1 where date_col = '2019-08-10' and VendorID = 1;
```
Verified from ".hoodie", that a new commit has succeeded and it added one
new parquet file to 2019-08-10 partition.
```
ls -ltr /tmp/hudi/hudi_tbl/date_col=2019-08-10/
total 2192
-rw-r--r-- 1 nsb wheel 571011 Feb 6 17:19
f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_10-27-119_20230206171846307.parquet
-rw-r--r-- 1 nsb wheel 529348 Feb 6 17:24
f5fa2a6c-8128-4591-9f27-94b5b7880a86-0_0-83-1538_20230206172355871.parquet
```
the 2nd parquet file was written due to the delete operation.
4. Triggered clustering job.
Property file contents
```
cat /tmp/cluster.props
hoodie.datasource.write.recordkey.field=tpep_pickup_datetime
hoodie.datasource.write.partitionpath.field=date_col
hoodie.datasource.write.precombine.field=tpep_dropoff_datetime
hoodie.upsert.shuffle.parallelism=8
hoodie.insert.shuffle.parallelism=8
hoodie.delete.shuffle.parallelism=8
hoodie.bulkinsert.shuffle.parallelism=8
hoodie.clustering.plan.strategy.sort.columns=date_col,tpep_pickup_datetime
hoodie.clustering.execution.strategy.class=org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy
hoodie.parquet.small.file.limit=0
hoodie.clustering.inline=true
hoodie.clustering.inline.max.commits=1
hoodie.clustering.plan.strategy.target.file.max.bytes=1073741824
hoodie.clustering.plan.strategy.small.file.limit=629145600
hoodie.clustering.async.enabled=true
hoodie.clustering.async.max.commits=1
```
```
./bin/spark-submit --class org.apache.hudi.utilities.HoodieClusteringJob
~/Downloads/hudi-utilities-bundle_2.11-0.12.2.jar --props /tmp/cluster.props
--mode scheduleAndExecute --base-path /tmp/hudi/hudi_tbl/ --table-name
hudi_ctas_cow1 --spark-memory 4g
```
Verified from ".hoodie" that I could see replace commit and it has
succeeded.
5. re-launched spark-sql and queried the table.
```
refresh table hudi_ctas_cow1;
select VendorId, count(*) from hudi_ctas_cow1 where date_col = '2019-08-10'
group by 1;
```
output
```
2 3988
Time taken: 3.818 seconds, Fetched 1 row(s)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]