Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

Colt McNealy Thu, 01 Jun 2023 10:20:18 -0700

Hi all,

I've got a rather naive question here. The spiritual goal of this KIP is to
reduce cross-rack traffic (normally, this manifests itself in terms of a
higher AWS/Azure bill as cloud providers charge for cross-AZ traffic).


To generalize, most deployments have three racks, RF=3, and one replica for
each partition in each rack. Therefore, in the steady state (absent any
cluster anomalies such as broker failure, etc) we are pretty confident that
there should be a replica for every partition (input, changelog,
repartition, output topic) on the same rack as a given Streams instance.

Why not just let Sophie's High-Availability Task Assignor do its thing, and
then *set the preferred leader* for each replica to a broker in the same
rack/AZ as the Active Task? This would solve two problems:

1. The current KIP can't make any improvements in the case where a Task has
three involved partitions (eg. input, changelog, output) and the leader for
each partition is in a different rack. With this approach, we could get
pretty close to having zero cross-AZ traffic in a healthy cluster.
2. There needs to be a lot of work done to balance availability, data
movement, and cross-AZ traffic in the current proposal. My proposal doesn't
actually involve any additional data movement; simply reassignment of
partition leadership.

The biggest argument against this proposal is that there could be two
Streams apps using the same topic, which would cause some bickering.
Secondly, some have observed that changing partition leadership can trigger
ProducerFencedExceptions in EOS, which causes a state restoration.

Colt McNealy

*Founder, LittleHorse.dev*


On Thu, Jun 1, 2023 at 10:02 AM Hao Li <[email protected]> wrote:

> Hi Bruno,
>
> dropping config rack.aware.assignment.enabled
> and add value NONE to the enum for the possible values of config
> rack.aware.assignment.strategy sounds good to me.
>
> On Thu, Jun 1, 2023 at 12:39 AM Bruno Cadonna <[email protected]> wrote:
>
> > Hi Hao,
> >
> > Thanks for the updates!
> >
> > What do you think about dropping config rack.aware.assignment.enabled
> > and add value NONE to the enum for the possible values of config
> > rack.aware.assignment.strategy?
> >
> > Best,
> > Bruno
> >
> > On 31.05.23 23:31, Hao Li wrote:
> > > Hi all,
> > >
> > > I've updated the KIP based on the feedback. Major changes I made:
> > > 1. Add rack aware assignment to `StickyTaskAssignor`
> > > 2. Reject `Prefer reliability and then find optimal cost` option in
> > standby
> > > task assignment.
> > >
> > >
> > > On Wed, May 31, 2023 at 12:09 PM Hao Li <[email protected]> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Thanks for the feedback! I will update the KIP accordingly.
> > >>
> > >> *For Sophie's comments:*
> > >>
> > >> 1 and 2. Good catch. Fixed these.
> > >>
> > >> 3 and 4 Yes. We can make this public config and call out the
> > >> clientConsumer config users need to set.
> > >>
> > >> 5. It's ideal to take the previous assignment in HAAssignor into
> > >> consideration when we compute our target assignment, the complications
> > come
> > >> with making sure the assignment can eventually converge and we don't
> do
> > >> probing rebalance infinitely. It's not only about storing the previous
> > >> assignment or get it somehow. We can actually get the previous
> > assignment
> > >> now like we do in StickyAssignor. But the previous assignment will
> > change
> > >> in each round of probing rebalance. The proposal which added some
> > weight to
> > >> make the rack aware assignment lean towards the original HAA's target
> > >> assignment will add benefits of stability in some corner cases in case
> > of
> > >> tie in cross rack traffic cost. But it's not sticky. But the bottom
> > line is
> > >> it won't be worse than current HAA's stickiness.
> > >>
> > >> 6. I'm fine with changing the assignor config to public. Actually, I
> > think
> > >> we can min-cost algorithm with StickyAssignor as well to mitigate the
> > >> problem of 5. So we can have one public config to choose an assignor
> and
> > >> one public config to enable the rack aware assignment.
> > >>
> > >> *For Bruno's comments:*
> > >>
> > >> The proposal was to implement all the options and use configs to
> choose
> > >> them during runtime. We can make those configs public as suggested.
> > >> 1, 2, 3, 4, 5: agree and will fix those.
> > >> 6: subscription protocol is not changed.
> > >> 7: yeah. Let me fix the notations.
> > >> 8: It meant clients. In the figure, it maps to `c1_1`, `c1_2`, `c1_3`
> > etc.
> > >> 9: I'm also ok with just optimizing reliability for standby tasks. Or
> we
> > >> could simply run the "balance reliability over cost" greedy algorithm
> to
> > >> see if any cost could be reduced.
> > >> 10: Make sense. Will fix the wording.
> > >> 11: Make sense. Will update the test part.
> > >>
> > >> *For Walker's comments:*
> > >> 1. Stability for HAA is an issue. See my comments for Sophie's
> feedback
> > 5
> > >> and 6. I think we could use the rack aware assignment for
> > StickyAssignor as
> > >> well. For HAA assignments, it's less sticky and we can only shoot for
> > >> minimizing the cross rack traffic eventually when everything is
> stable.
> > >> 2. Yeah. This is a good point and we can also turn it on for
> > >> StickyAssignor.
> > >>
> > >> Thanks,
> > >> Hao
> > >>
> > >>
> > >> On Tue, May 30, 2023 at 2:28 PM Sophie Blee-Goldman <
> > >> [email protected]> wrote:
> > >>
> > >>> Hey Hao, thanks for the KIP!
> > >>>
> > >>> 1. There's a typo in the "internal.rack.aware.assignment.strategry"
> > >>> config,
> > >>> this
> > >>> should be internal.rack.aware.assignment.strategy.
> > >>>
> > >>> 2.
> > >>>
> > >>>>   For O(E^2 * (CU)) complexity, C and U can be viewed as constant.
> > >>> Number of
> > >>>> edges E is T * N where T is the number of clients and N is the
> number
> > of
> > >>>> Tasks. This is because a task can be assigned to any client so there
> > >>> will
> > >>>> be an edge between every task and every client. The total complexity
> > >>> would
> > >>>> be O(T * N) if we want to be more specific.
> > >>>
> > >>> I feel like I'm missing something here, but if E = T * N and the
> > >>> complexity
> > >>> is ~O(E^2), doesn't
> > >>> this make the total complexity order of O(T^2 * N^2)?
> > >>>
> > >>> 3.
> > >>>
> > >>>> Since 3.C.I and 3.C.II have different tradeoffs and work better in
> > >>>> different workloads etc, we
> > >>>
> > >>> could add an internal configuration to choose one of them at runtime.
> > >>>>
> > >>> Why only an internal configuration? Same goes for
> > >>> internal.rack.aware.assignment.standby.strategry (which also has the
> > typo)
> > >>>
> > >>> 4.
> > >>>
> > >>>>   There are no changes in public interfaces.
> > >>>
> > >>> I think it would be good to explicitly call out that users can
> utilize
> > >>> this
> > >>> new feature by setting the
> > >>> ConsumerConfig's CLIENT_RACK_CONFIG, possibly with a brief example
> > >>>
> > >>> 5.
> > >>>
> > >>>> The idea is that if we always try to make it overlap as much with
> > >>>> HAAssignor’s target
> > >>>
> > >>> assignment, at least there’s a higher chance that tasks won’t be
> > shuffled
> > >>> a
> > >>>> lot if the clients
> > >>>
> > >>> remain the same across rebalances.
> > >>>>
> > >>> This line definitely gave me some pause -- if there was one major
> > takeaway
> > >>> I had after KIP-441,
> > >>> one thing that most limited the feature's success, it was our
> > assumption
> > >>> that clients are relatively
> > >>> stable across rebalances. This was mostly true at limited scale or
> for
> > >>> on-prem setups, but
> > >>> unsurprisingly broke down in cloud environments or larger clusters.
> Not
> > >>> only do clients naturally
> > >>> fall in and out of the group, autoscaling is becoming more and more
> of
> > a
> > >>> thing.
> > >>>
> > >>> Lastly, and this is more easily solved but still worth calling out,
> an
> > >>> assignment is only deterministic
> > >>> as long as the client.id is persisted. Currently in Streams, we only
> > >>> write
> > >>> the process UUID to the
> > >>> state directory if there is one, ie if at least one persistent
> stateful
> > >>> task exists in the topology. This
> > >>> made sense in the context of KIP-441, which targeted heavily stateful
> > >>> deployments, but this KIP
> > >>> presumably intends to target more than just the persistent & stateful
> > >>> subset of applications. To
> > >>> make matters even worse,  "persistent" is defined in a semantically
> > >>> inconsistent way throughout
> > >>> Streams.
> > >>>
> > >>> All this is to say, it may sound more complicated to remember the
> > previous
> > >>> assignment, but (a)
> > >>> imo it only introduces a lot more complexity and shaky assumptions to
> > >>> continue down this
> > >>> path, and (b) we actually already do persist some amount of state,
> like
> > >>> the
> > >>> process UUID, and
> > >>> (c) it seems like this is the perfect opportunity to finally rid
> > ourselves
> > >>> of the determinism constraint
> > >>> which has frankly caused more trouble and time lost in sum than it
> > would
> > >>> have taken us to just
> > >>> write the HighAvailabilityTaskAssignor to consider the previous
> > assignment
> > >>> from the start in KIP-441
> > >>>
> > >>> 6.
> > >>>
> > >>>> StickyTaskAssignor  users who would like to use rack aware
> assignment
> > >>>> should upgrade their
> > >>>
> > >>> Kafka Streams version to the version in which
> > HighAvailabilityTaskAssignor
> > >>>> and rack awareness
> > >>>
> > >>> assignment are available.
> > >>>
> > >>> Building off of the above, the HAAssignor hasn't worked out perfectly
> > for
> > >>> everybody up until now,
> > >>> given that we are only adding complexity to it now, on the flipside I
> > >>> would
> > >>> hesitate to try and force
> > >>> everyone to use it if they want to upgrade. We added a "secret"
> > backdoor
> > >>> internal config to allow
> > >>> users to set the task assignor back in KIP-441 for this reason. WDYT
> > about
> > >>> bumping this to a public
> > >>> config on the side in this KIP?
> > >>>
> > >>>
> > >>> On Tue, May 23, 2023 at 11:46 AM Hao Li <[email protected]>
> > wrote:
> > >>>
> > >>>> Thanks John! Yeah. The ConvergenceTest looks very helpful. I will
> add
> > >>> it to
> > >>>> the test plan. I will also add tests to verify the new optimizer
> will
> > >>>> produce a balanced assignment which has no worse cross AZ cost than
> > the
> > >>>> existing assignor.
> > >>>>
> > >>>> Hao
> > >>>>
> > >>>> On Mon, May 22, 2023 at 3:39 PM John Roesler <[email protected]>
> > >>> wrote:
> > >>>>
> > >>>>> Hi Hao,
> > >>>>>
> > >>>>> Thanks for the KIP!
> > >>>>>
> > >>>>> Overall, I think this is a great idea. I always wanted to circle
> back
> > >>>>> after the Smooth Scaling KIP to put a proper optimization algorithm
> > >>> into
> > >>>>> place. I think this has the promise to really improve the quality
> of
> > >>> the
> > >>>>> balanced assignments we produce.
> > >>>>>
> > >>>>> Thanks for providing the details about the MaxCut/MinFlow
> algorithm.
> > >>> It
> > >>>>> seems like a good choice for me, assuming we choose the right
> scaling
> > >>>>> factors for the weights we add to the graph. Unfortunately, I don't
> > >>> think
> > >>>>> that there's a good way to see how easy or hard this is going to be
> > >>> until
> > >>>>> we actually implement it and test it.
> > >>>>>
> > >>>>> That leads to the only real piece of feedback I had on the KIP,
> which
> > >>> is
> > >>>>> the testing portion. You mentioned system/integration/unit tests,
> but
> > >>>>> there's not too much information about what those tests will do.
> I'd
> > >>> like
> > >>>>> to suggest that we invest in more simulation testing specifically,
> > >>>> similar
> > >>>>> to what we did in
> > >>>>>
> > >>>>
> > >>>
> >
> https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/processor/internals/assignment/TaskAssignorConvergenceTest.java
> > >>>>> .
> > >>>>>
> > >>>>> In fact, it seems like we _could_ write the simulation up front,
> and
> > >>> then
> > >>>>> implement the algorithm in a dummy way and just see whether it
> passes
> > >>> the
> > >>>>> simulations or not, before actually integrating it with Kafka
> > Streams.
> > >>>>>
> > >>>>> Basically, I'd be +1 on this KIP today, but I'd feel confident
> about
> > >>> it
> > >>>> if
> > >>>>> we had a little more detail regarding how we are going to verify
> that
> > >>> the
> > >>>>> new optimizer is actually going to produce more optimal plans than
> > the
> > >>>>> existing assigner we have today.
> > >>>>>
> > >>>>> Thanks again!
> > >>>>> -John
> > >>>>>
> > >>>>> On 2023/05/22 16:49:22 Hao Li wrote:
> > >>>>>> Hi Colt,
> > >>>>>>
> > >>>>>> Thanks for the comments.
> > >>>>>>
> > >>>>>>> and I struggle to see how the algorithm isn't at least O(N) where
> > >>> N
> > >>>> is
> > >>>>>> the number of Tasks...?
> > >>>>>>
> > >>>>>> For O(E^2 * (CU)) complexity, C and U can be viewed as constant.
> > >>> Number
> > >>>>> of
> > >>>>>> edges E is T * N where T is the number of clients and N is the
> > >>> number
> > >>>> of
> > >>>>>> Tasks. This is because a task can be assigned to any client so
> there
> > >>>> will
> > >>>>>> be an edge between every task and every client. The total
> complexity
> > >>>>> would
> > >>>>>> be O(T * N) if we want to be more specific.
> > >>>>>>
> > >>>>>>> But if the leaders for each partition are spread across multiple
> > >>>> zones,
> > >>>>>> how will you handle that?
> > >>>>>>
> > >>>>>> This is what the min-cost flow solution is trying to solve? i.e.
> > >>> Find
> > >>>> an
> > >>>>>> assignment of tasks to clients where across AZ traffic can be
> > >>>> minimized.
> > >>>>>> But there are some constraints to the solution and one of them is
> we
> > >>>> need
> > >>>>>> to balance task assignment first (
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Designforrackawareassignment
> > >>>>> ).
> > >>>>>> So in your example of three tasks' partitions being in the same AZ
> > >>> of a
> > >>>>>> client, if there are other clients, we still want to balance the
> > >>> tasks
> > >>>> to
> > >>>>>> other clients even if putting all tasks to a single client can
> > >>> result
> > >>>> in
> > >>>>> 0
> > >>>>>> cross AZ traffic. In
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams#KIP925:RackawaretaskassignmentinKafkaStreams-Algorithm
> > >>>>>> section, the algorithm will try to find a min-cost solution based
> on
> > >>>>>> balanced assignment instead of pure min-cost.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Hao
> > >>>>>>
> > >>>>>> On Tue, May 9, 2023 at 5:55 PM Colt McNealy <[email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hello Hao,
> > >>>>>>>
> > >>>>>>> First of all, THANK YOU for putting this together. I had been
> > >>> hoping
> > >>>>>>> someone might bring something like this forward. A few comments:
> > >>>>>>>
> > >>>>>>> **1: Runtime Complexity
> > >>>>>>>> Klein’s cycle canceling algorithm can solve the min-cost flow
> > >>>>> problem in
> > >>>>>>> O(E^2CU) time where C is max cost and U is max capacity. In our
> > >>>>> particular
> > >>>>>>> case, C is 1 and U is at most 3 (A task can have at most 3 topics
> > >>>>> including
> > >>>>>>> changelog topic?). So the algorithm runs in O(E^2) time for our
> > >>> case.
> > >>>>>>>
> > >>>>>>> A Task can have multiple input topics, and also multiple state
> > >>>> stores,
> > >>>>> and
> > >>>>>>> multiple output topics. The most common case is three topics as
> > >>> you
> > >>>>>>> described, but this is not necessarily guaranteed. Also, math is
> > >>> one
> > >>>>> of my
> > >>>>>>> weak points, but to me O(E^2) is equivalent to O(1), and I
> > >>> struggle
> > >>>> to
> > >>>>> see
> > >>>>>>> how the algorithm isn't at least O(N) where N is the number of
> > >>>>> Tasks...?
> > >>>>>>>
> > >>>>>>> **2: Broker-Side Partition Assignments
> > >>>>>>> Consider the case with just three topics in a Task (one input,
> one
> > >>>>> output,
> > >>>>>>> one changelog). If all three partition leaders are in the same
> > >>> Rack
> > >>>> (or
> > >>>>>>> better yet, the same broker), then we could get massive savings
> by
> > >>>>>>> assigning the Task to that Rack/availability zone. But if the
> > >>> leaders
> > >>>>> for
> > >>>>>>> each partition are spread across multiple zones, how will you
> > >>> handle
> > >>>>> that?
> > >>>>>>> Is that outside the scope of this KIP, or is it worth introducing
> > >>> a
> > >>>>>>> kafka-streams-generate-rebalance-proposal.sh tool?
> > >>>>>>>
> > >>>>>>> Colt McNealy
> > >>>>>>> *Founder, LittleHorse.io*
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, May 9, 2023 at 4:03 PM Hao Li <[email protected]>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi all,
> > >>>>>>>>
> > >>>>>>>> I have submitted KIP-925 to add rack awareness logic in task
> > >>>>> assignment
> > >>>>>>> in
> > >>>>>>>> Kafka Streams and would like to start a discussion:
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-925%3A+Rack+aware+task+assignment+in+Kafka+Streams
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Thanks,
> > >>>>>>>> Hao
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Thanks,
> > >>>>>> Hao
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Thanks,
> > >>>> Hao
> > >>>>
> > >>>
> > >>
> > >>
> > >> --
> > >> Thanks,
> > >> Hao
> > >>
> > >
> > >
> >
>
>
> --
> Thanks,
> Hao
>

Re: [DISCUSS] KIP-925: rack aware task assignment in Kafka Streams

Reply via email to