rnatarajan commented on issue #2083:
URL: https://github.com/apache/hudi/issues/2083#issuecomment-692486055


   Update on this: 
   
   Found the bottleneck as 
[countByKey](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java#L73)
   
   We were reading data from Kafka(spread across 20 partitions)
   We tested with hoodie.datasource.write.partitionpath.field as "" or 
"<somefield>"
   
   In both cases, records read from Kafka across all partitioned(For a batch) 
was shuffled performing countByKey.
   This caused a major throughput drop.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to