spyzzz edited a comment on issue #2193: URL: https://github.com/apache/hudi/issues/2193#issuecomment-715125484
For this POC use case i only had insert, but in production use i'll get update too, even if insert represent 99% of transaction , i've to assure that hudi table didn't get any duplicates for key. To sum up the situation : i'll get multiple kafka topics filled by CDC débézium events (from mysql mainly). Its a kafka 2.6 with compaction, so at any time, i'll get at least one row per key so that at every moment all the mysql table row are avaible in this topics kafka. The first synchronisation will be very heavy (that's what i'm trying to do with 10 or 100M records), but once the first big synchronisation done, i'll juste have to catch freshly new CDC events from each topics ( and here its like 10000 events per day mainly detele/update) Maybe i can try a first job with specific configuration to do the first big synchronisation, and then switch to another job to handle only CDC upsert/delete events, its possible because i'm using structured streaming with checkpooints. I tried to do a insert only job, but i've several micro batches, so i can be sure there will be unique key in hudi table. If i do one big batch with all events and then use deduplicate with hudi its could be possible, but if i've table with 300M+ records, i'm not sure its the good way to go with a huge unique batch. Hope the situation is clear. -> With hoodie.parquet.small.file.limit=104857600 no changes still 4 files of 74.4Mb per 5M records ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
