[GitHub] [hudi] waitingF commented on a diff in pull request #8376: [HUDI-6019] support split kafka source by count

via GitHub Sat, 08 Apr 2023 20:39:06 -0700


waitingF commented on code in PR #8376:
URL: https://github.com/apache/hudi/pull/8376#discussion_r1161192725



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/config/KafkaSourceConfig.java:
##########
@@ -63,6 +63,14 @@ public class KafkaSourceConfig extends HoodieConfig {
       .defaultValue(5000000L)
       .withDocumentation("Maximum number of records obtained in each batch.");
 
+  public static final ConfigProperty<Long> MAX_EVENTS_PER_KAFKA_PARTITION = 
ConfigProperty

Review Comment:
   I have considered using a configuration similar to 
[`minPartitions`](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html)
 in Spark Structured Streaming to control the concurrency, but this parameter 
does not intuitively control the data volume of each partition. Since my 
initial intention for this feature was to control the maximum data volume of 
each partition in order to reduce the time spent pulling data from Kafka, and 
considering the coherence of the configuration along with the 
config(**MAX_EVENTS_FROM_KAFKA_SOURCE**) of controlling the maximum data volume 
from Kafka, that's why I used the configuration for maximum data volume per 
partition.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] waitingF commented on a diff in pull request #8376: [HUDI-6019] support split kafka source by count

Reply via email to