Re: [PR] [GOBBLIN-1956]Make Kafka streaming pipeline be able to config the max poll records during runtime [gobblin]

via GitHub Wed, 15 Nov 2023 19:01:32 -0800


homatthew commented on code in PR #3827:
URL: https://github.com/apache/gobblin/pull/3827#discussion_r1395095423



##########
gobblin-modules/gobblin-kafka-common/src/main/java/org/apache/gobblin/source/extractor/extract/kafka/KafkaStreamingExtractor.java:
##########
@@ -214,6 +220,19 @@ public LongWatermark getLwm() {
 
   public KafkaStreamingExtractor(WorkUnitState state) {
     super(state);
+    this.topicPartitions = getTopicPartitionsFromWorkUnit(state);
+    Map<KafkaPartition, LongWatermark> topicPartitionWatermarks = 
getTopicPartitionWatermarks(this.topicPartitions);
+    if (this.maxAvgRecordSize > 0 ) {
+      long maxPollRecords =
+          state.getPropAsLong(MAX_KAFKA_BUFFER_SIZE_IN_BYTES, 
DEFAULT_MAX_KAFKA_BUFFER_SIZE_IN_BYTES) / maxAvgRecordSize;

Review Comment:
   Makes sense. I think 50MB seems like a good default. 
   
   Here's an example of how the throughput would look for some higher volume 
topics. All topics under 5kb would be able to do 1000 records per second 
(pageviewevent is only ~800bytes and URE is ~3MB). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [GOBBLIN-1956]Make Kafka streaming pipeline be able to config the max poll records during runtime [gobblin]

Reply via email to