FYI: The difference between `groupBy` (may trigger re-partitioning) vs.
`groupByKey` (does not trigger re-partitioning) also applies to:

- `map` vs. `mapValues`
- `flatMap` vs. `flatMapValues`



On Wed, Mar 1, 2017 at 8:15 PM, Damian Guy <damian....@gmail.com> wrote:

> If you use stream.groupByKey() then there will be no repartitioning as long
> as there have been no key changing operations preceding it, i.e, map,
> selectKey, flatMap, transform. If you use stream.groupBy(...) then we see
> it as a key changing operation, hence we need to repartition the data.
>
> On Wed, 1 Mar 2017 at 18:59 Tianji Li <skyah...@gmail.com> wrote:
>
> > Hi there,
> >
> > I wonder if it makes sense to give the option to disable auto
> > repartitioning while doing groupBy.
> >
> > I understand with https://issues.apache.org/jira/browse/KAFKA-3561,
> > an internal topic for repartition will be automatically created and
> synced
> > to brokers, which is useful when aggregation keys are not the ones used
> > when ingesting raw data.
> >
> > However, in my case, I have carefully partitioned the data when ingesting
> > my raw topics. If I do groupBy followed by aggregation, there will be TWO
> > change logs topics, one for groupBy another or aggregation.
> >
> > Does it make sense to make the groupBy one configurable?
> >
> > Thanks
> > Tianji
> >
>



-- 
*Michael G. Noll*
Product Manager | Confluent
+1 650 453 5860 | @miguno <https://twitter.com/miguno>
Follow us: Twitter <https://twitter.com/ConfluentInc> | Blog
<http://www.confluent.io/blog>

Reply via email to