[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

GitBox Mon, 14 Sep 2020 23:04:44 -0700


rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-692486055



   Update on this: 
   
   Found the bottleneck as 
[countByKey](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java#L73)
   
   We were reading data from Kafka(spread across 20 partitions)
   We tested with hoodie.datasource.write.partitionpath.field as "" or 
"<somefield>"
   
   In both cases, records read from Kafka across all partitioned(For a batch) 
was shuffled performing countByKey.
   This caused a major throughput drop.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rnatarajan commented on issue #2083: Kafka readStream performance slow [SUPPORT]

Reply via email to