[GitHub] [hudi] n3nash commented on issue #2083: Kafka readStream performance slow [SUPPORT]

GitBox Thu, 17 Sep 2020 23:21:53 -0700


n3nash commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-694679559



   @rnatarajan Thanks for sharing this information, this is helpful. Firstly, 
you seem to have 24 cores (4*6) which means you can get a parallelism of 24. 
So, for you can try setting `bulkinsert.shuffle.parallelism=24` for starters. I 
need some more information based on what you provided : 
   1) Right now, you are able to ingest 15K rows/second with the current setup, 
but you want to achieve 20K rows/second, is that correct ?
   2) Are you using SparkStructured Streaming to ingest or are you using Spark 
datasource and running batch jobs ? If it is spark structured streaming, can 
you share screenshots of the read stages of the DAG, essentially, the stages 
where spark is reading from Kafka ?
   3) Which part of the entire DAG is taking most time right now ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] n3nash commented on issue #2083: Kafka readStream performance slow [SUPPORT]

Reply via email to