Hi, Yan:
Thanks a lot for your answer.
Sincerely,
Selina
On Mon, Oct 26, 2015 at 8:03 PM, Yan Fang <[email protected]> wrote:
> Hi Selina,
>
>
> Your understanding is correct. Yes, you "need to consumer the original
> input and send it back to Kafka and reset the* Key to departmentName *and
> then consume it again
> to count in Samza" if you want to count the number of students in the same
> departmentName. This is a typical aggregation use case. Because after
> aggregating the students in the same department, you can do more than just
> "count". :)
>
>
> Cheers,
> Yan
>
>
> At 2015-10-25 06:12:50, "Selina Tech" <[email protected]> wrote:
> >Hi, Yan:
> >
> > Thanks a lot for your reply.
> >
> > You mentioned "if you give the msgs the same partition key", which
> >mean same partition key value or same partition key attribute name?
> >
> > I mentioned "primary key" as "key" at public
> >KeyedMessage(java.lang.String topic, K key, V message) or you can ignore
> >it. I explain it in another way below.
> >
> > If I need aggregate data, but the data are not in same partition,
> do
> >we need consumer the data, and put it back it to Kafka with with new key
> >and then consumer it again and aggregate it in Samza.
> >
> > For example, messages about student GPA information was send to
> >Kafka by* K key(String schoolName)*. The message looks like "name,
> >schoolName, departmentName, grade, GPA", and assuming I have 3
> >partitions, With my understanding, all student records in one school
> should
> >go to same partition.
> >
> > Right now I need to aggregate data for same department, no matter
> >which school. which mean all the same departmentName message will be in
> >three different partition. If I just count it in one samza job, will the
> >result correct? Do I need to consumer the original input and send it back
> >to Kafka and reset the* Key to departmentName *and then consume it again
> >to count in Samza?
> >
> > If I did not understand the partition and task of Samza, would you
> >like to correct me?
> >
> >Sincerely,
> >Selina
> >
> >On Sat, Oct 24, 2015 at 2:45 AM, Yan Fang <[email protected]> wrote:
> >
> >>
> >>
> >> Hi Selina,
> >>
> >>
> >> what do you mean by "primary key" here? Is it one of the partitions of
> >> "input" or something like "if one msg meets condition x, we think msg
> has
> >> the primary key"?
> >>
> >>
> >> If you just want to count the msgs, you can count in one Samza job and
> >> send the result to "output" topic. You can send to any partition of the
> >> "output" if you give the msgs the same partition key.
> >>
> >>
> >> Thanks,
> >> Yan
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> At 2015-10-22 08:30:15, "Selina Tech" <[email protected]> wrote:
> >> >Hi, All:
> >> >
> >> > In the Samza document, it mentioned "Each task consumes data
> from
> >> >one partition for each of the job’s input streams." Does it mean if the
> >> >data processing one job is not in one partition, the result will be
> wrong.
> >> >
> >> > Assuming my Samza input data on Kafka topic -- "input" is
> >> >partitioned by default -- round robin. And I have five partitions. If
> my
> >> >Samza job is to count messages by primary key of the message at "input"
> >> >topic, and then output it to kafka topic -- "output".
> >> >
> >> > So I need steps as below
> >> > 1. read data from Kafka topic "input"
> >> > 2. reset the partition key to "primary key" in Samza
> >> > 3. produce it back to Kafka topic named as "temp"
> >> > 4. read "temp" topic at Samza
> >> > 5. count it in Samza
> >> > 6. write it to Kafka topic named as "output"
> >> >
> >> > If I just read data from Kafka topic "input" and count it in
> Samza
> >> >and write it to topic "output". The result will not be correct because
> >> there
> >> >might have multiple messages for same "primary key" in "output"
> topic. Do
> >> >I understand it correctly?
> >> >
> >> >Sincerely,
> >> >Selina
> >>
>