Matthias J. Sax created KAFKA-19963:
---------------------------------------

             Summary: Explain how to parallelize per topic with Kafka Streams
                 Key: KAFKA-19963
                 URL: https://issues.apache.org/jira/browse/KAFKA-19963
             Project: Kafka
          Issue Type: Improvement
          Components: docs, streams
            Reporter: Matthias J. Sax


We regularly get the question, how one can break a KS program into more tasks, 
for better parallelization. The pattern is usually something like this:
{code:java}
KStream input = builder.stream(<list-of-topics--or--pattern>);
KStream result = input.filter(...).map(...); // or any other logic
result.to("output-topic");{code}
The above program reads from multiple topics, but creates a single 
sub-topology, and thus, the maximum number of partitions across all input 
topics is the number of task we get. However, there is no reason to funnel the 
data of all partitions-X across all topics through a single task X.

To break up the program, one can rewrite the topology to create multiple 
sub-topologies, allowing for independent tasks per topic:
{code:java}
List topics = <list-of-topics>
for (String topic : topics) {
    KStream input = builder.stream(topic);
    KStream result = input.filter(...).map(...); // or any other logic
    result.to("output-topic");
}{code}
The above program creates an independent sub-topology per input topic, each 
getting its own set of tasks.

We should add this information to the docs, as this question comes up regularly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to