n3nash commented on issue #2083: URL: https://github.com/apache/hudi/issues/2083#issuecomment-694679559
@rnatarajan Thanks for sharing this information, this is helpful. Firstly, you seem to have 24 cores (4*6) which means you can get a parallelism of 24. So, for you can try setting `bulkinsert.shuffle.parallelism=24` for starters. I need some more information based on what you provided : 1) Right now, you are able to ingest 15K rows/second with the current setup, but you want to achieve 20K rows/second, is that correct ? 2) Are you using SparkStructured Streaming to ingest or are you using Spark datasource and running batch jobs ? If it is spark structured streaming, can you share screenshots of the read stages of the DAG, essentially, the stages where spark is reading from Kafka ? 3) Which part of the entire DAG is taking most time right now ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
