[GitHub] [hudi] abhijeetkushe commented on issue #1737: [SUPPORT]spark streaming create small parquet files

GitBox Sat, 12 Sep 2020 13:12:20 -0700


abhijeetkushe commented on issue #1737:
URL: https://github.com/apache/hudi/issues/1737#issuecomment-691268965



   @bvaradar The hudi version we are using 0.5.2-incubating deployed on EMR. 
Good point on the terminology.I will rephrase my question
   COW with 'hoodie.cleaner.commits.retained': 1,
   1. I am writing 6000 events using Spark-Hudi in **Append** mode with Hive 
sync turned on and I am able to see 1 parquet file in S3 and can query 6000 
events using Presto
   2. I am writing 100 more events using Spark in **Append** mode and I see 2 
parquet files (1 of them appears to be a version whereas the other parquet 
larger in size is more recent is supposed to have 6100 events)
   3. I am writing 100 more events using Spark in **Append** mode and I still 
see 2 parquet files (1st file written in step 1 appears to be cleaned up due to 
hoodie.cleaner.commits.retained: 1 setting and latest file has 6200 events)
   We called this compaction because normally Spark write in **Append** mode 
(without HUDI) will write 2 parquet files and 2nd one will have 100 events 
while 1st will have 6000 events
   
   In COW I see that with each Spark write I see only 2 files which is the 
desired behavior but in MOR I see the latest version does seem to have all the 
events but I dont see the previous versions being cleaned up.Does this make it 
more clearer.I can share more snapshots but I wanted to ensure the terminology 
is correct
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] abhijeetkushe commented on issue #1737: [SUPPORT]spark streaming create small parquet files

Reply via email to