abhijeetkushe commented on issue #1737: URL: https://github.com/apache/hudi/issues/1737#issuecomment-691268965
@bvaradar The hudi version we are using 0.5.2-incubating deployed on EMR. Good point on the terminology.I will rephrase my question COW with 'hoodie.cleaner.commits.retained': 1, 1. I am writing 6000 events using Spark-Hudi in **Append** mode with Hive sync turned on and I am able to see 1 parquet file in S3 and can query 6000 events using Presto 2. I am writing 100 more events using Spark in **Append** mode and I see 2 parquet files (1 of them appears to be a version whereas the other parquet larger in size is more recent is supposed to have 6100 events) 3. I am writing 100 more events using Spark in **Append** mode and I still see 2 parquet files (1st file written in step 1 appears to be cleaned up due to hoodie.cleaner.commits.retained: 1 setting and latest file has 6200 events) We called this compaction because normally Spark write in **Append** mode (without HUDI) will write 2 parquet files and 2nd one will have 100 events while 1st will have 6000 events In COW I see that with each Spark write I see only 2 files which is the desired behavior but in MOR I see the latest version does seem to have all the events but I dont see the previous versions being cleaned up.Does this make it more clearer.I can share more snapshots but I wanted to ensure the terminology is correct ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
