spyzzz edited a comment on issue #2193:
URL: https://github.com/apache/hudi/issues/2193#issuecomment-715125484


   For this POC use case i only had insert, but in production use i'll get 
update too, even if insert represent 99% of transaction , i've to assure that 
hudi table didn't get any duplicates for key. 
   
   To sum up the situation : i'll get multiple kafka topics filled by CDC 
débézium events (from mysql mainly). Its a kafka 2.6 with compaction, so at any 
time, i'll get at least one row per key so that at every moment all the mysql 
table row are avaible in this topics kafka. 
   
   The first synchronisation will be very heavy (that's what i'm trying to do 
with 10 or 100M records), but once the first big synchronisation done, i'll 
juste have to catch freshly new CDC events from each topics ( and here its like 
10000 events per day mainly detele/update)
   
   Maybe i can try a first job with specific configuration to do the first big 
synchronisation, and then switch to another job to handle only CDC 
upsert/delete events, its possible because i'm using structured streaming with 
checkpooints.
   
   I tried to do a insert only job, but i've several micro batches, so i can't 
be sure there will be unique key in hudi table. If i do one big batch with all 
events and then use deduplicate with hudi its could be possible, but if i've 
table with 300M+ records, i'm not sure its the good way to go with a huge 
unique batch.
   
   Hope the situation is clear.
   
   
   -> With hoodie.parquet.small.file.limit=104857600 no changes  still 4 files 
of 74.4Mb per 5M records
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to