spyzzz commented on issue #2193: URL: https://github.com/apache/hudi/issues/2193#issuecomment-716814642
Yes you're right, the metadata i gave you was from the initial boostraps. So yes, i've two differents workloads, you sum'up it pretty cleary. Except that, for the (1) it might be possible to get update in the initial bootstrap (its wasnt the case for the files i gave you) but that's why i can't use BULK_INSERT mode. With the fews configuration tips you gave me, i'm able to keep pretty linear time for each 5Millions rows batch so yes its better. I'm around 500K rows/min (i don't really know if its correct or not) with a 6 partitions kafka topics. For the (2), in CDC mode yes, i've a lot less rows, but i've to do update pretty often (let says every 10min), I tried this afternoon, and i'm able to handle a 10min micro batch in 2min or so. (and the hudi output stay pretty acceptable, not too many files, every 10 delta file, hudi create a -40/70Mb parquet file ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
