It should be:

groupBy -> always trigger repartitioning
groupByKey -> maybe trigger repartitioning

And there will not be two repartitioning topics. The repartitioning will
be done by the groupBy/groupByKey operation, and thus, in the
aggregation step we know that data is correctly partitioned and there
will be no second repartitioning topic.



-Matthias

On 3/1/17 11:25 AM, Michael Noll wrote:
> FYI: The difference between `groupBy` (may trigger re-partitioning) vs.
> `groupByKey` (does not trigger re-partitioning) also applies to:
> 
> - `map` vs. `mapValues`
> - `flatMap` vs. `flatMapValues`
> 
> 
> 
> On Wed, Mar 1, 2017 at 8:15 PM, Damian Guy <damian....@gmail.com> wrote:
> 
>> If you use stream.groupByKey() then there will be no repartitioning as long
>> as there have been no key changing operations preceding it, i.e, map,
>> selectKey, flatMap, transform. If you use stream.groupBy(...) then we see
>> it as a key changing operation, hence we need to repartition the data.
>>
>> On Wed, 1 Mar 2017 at 18:59 Tianji Li <skyah...@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I wonder if it makes sense to give the option to disable auto
>>> repartitioning while doing groupBy.
>>>
>>> I understand with https://issues.apache.org/jira/browse/KAFKA-3561,
>>> an internal topic for repartition will be automatically created and
>> synced
>>> to brokers, which is useful when aggregation keys are not the ones used
>>> when ingesting raw data.
>>>
>>> However, in my case, I have carefully partitioned the data when ingesting
>>> my raw topics. If I do groupBy followed by aggregation, there will be TWO
>>> change logs topics, one for groupBy another or aggregation.
>>>
>>> Does it make sense to make the groupBy one configurable?
>>>
>>> Thanks
>>> Tianji
>>>
>>
> 
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to