rnatarajan commented on issue #2083: URL: https://github.com/apache/hudi/issues/2083#issuecomment-692486055
Update on this: Found the bottleneck as [countByKey](https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java#L73) We were reading data from Kafka(spread across 20 partitions) We tested with hoodie.datasource.write.partitionpath.field as "" or "<somefield>" In both cases, records read from Kafka across all partitioned(For a batch) was shuffled performing countByKey. This caused a major throughput drop. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
