Hi, All:

        In the Samza document, it mentioned "Each task consumes data from
one partition for each of the job’s input streams." Does it mean if the
data processing one job is not in one partition, the result will be wrong.

        Assuming my Samza input data on Kafka topic -- "input" is
partitioned by default -- round robin. And I have five partitions. If my
Samza job is to count messages by primary key of the message at "input"
topic, and then output it to kafka topic -- "output".

       So I need steps as below
      1. read data from Kafka topic "input"
      2. reset the partition key to "primary key" in Samza
      3. produce it back to Kafka topic named as "temp"
      4. read "temp" topic at Samza
      5. count it in Samza
      6. write it to Kafka topic named as "output"

      If I just read data from Kafka topic "input" and count it in Samza
and write it to topic "output". The result will not be correct because there
might have multiple messages for same "primary key" in "output" topic.  Do
I understand it correctly?

Sincerely,
Selina

Reply via email to