Last week I was able to spend a bit of time working on KIP-236 again and,
based on the discussion about that with Jun back in December, I refactored
the controller to store the reassignment state in /brokers/topics/${topic}
instead of introducing new ZK nodes. This morning I was wondering what to
do as a next step, as these changes are more or less useless on their own,
without APIs for discovering the current partitions and/or reassigning
partitions. I started thinking again about this KIP, and realised that
using an internal compacted topic (say __partition_reassignments), as
suggested by Steven and Colin, would require changes in basically the same
places.

Thinking through some of the failure modes ("what if I update ZK, but can't
update produce to the topic?") I realised that it would actually be
possible to simply remove storing this info from ZK entirely and just store
this state in the __partition_reassignments topic. Doing it that way would
eliminate those failure modes and would allow clients interested in
reassignment completion the possibility to consume from this topic and
respond to records published with a null value (indicating completion of a
reassignment).

There are some interesting implications to doing this:

1. This __partition_reassignments topic would need to be replicated in
order to have availability of reassignment (if the leader of a partition of
__partition_reassignments is not available then reassignment of those
partitions whose state is held by the partition
of __partition_reassignments would not be reassignable).
2. We would want to avoid unclean leader election for this topic.

But I am interested in what other people think about this approach?

Cheers,

Tom


On 9 January 2018 at 21:18, Colin McCabe <cmcc...@apache.org> wrote:

> What if we had an internal topic which watchers could listen to for
> information about partition reassignments?  The information could be in
> JSON, so if we want to add new fields later, we always could.
>
> This avoids introducing a new AdminClient API.  For clients that want to
> be notified about partition reassignments in a timely fashion, this avoids
> the "polling an AdminClient API in a tight loop" antipattern.  It allows
> watchers to be notified in a simple and natural way about what is going
> on.  Access can be controlled by the existing topic ACL mechanisms.
>
> best,
> Colin
>
>
> On Fri, Dec 22, 2017, at 06:48, Tom Bentley wrote:
> > Hi Steven,
> >
> > I must admit that I didn't really considered that option. I can see how
> > attractive it is from your perspective. In practice it would come with
> lots
> > of edge cases which would need to be thought through:
> >
> > 1. What happens if the controller can't produce a record to this topic
> > because the partitions leader is unavailable?
> > 2. One solution to that is for the topic to be replicated on every
> broker,
> > so that the controller could elect itself leader on controller failover.
> > But that raises another problem: What if, upon controller failover, the
> > controller is ineligible for leader election because it's not in the ISR?
> > 3. The above questions suggest the controller might not always be able to
> > produce to the topic, but the controller isn't able to control when other
> > brokers catch up replicating moved partitions and has to deal with those
> > events. The controller would have to record (in memory) that the
> > reassignment was complete, but hadn't been published, and publish later,
> > when it was able to.
> > 4. Further to 3, we would need to recover the in-memory state of
> > reassignments on controller failover. But now we have to consider what
> > happens if the controller cannot *consume* from the topic.
> >
> > This seems pretty complicated to me. I think each of the above points has
> > alternatives (or compromises) which might make the problem more
> tractable,
> > so I'd welcome hearing from anyone who has ideas on that. In particular
> > there are parallels with consumer offsets which might be worth thinking
> > about some more.
> >
> > I would be useful it define better the use case we're trying to cater to
> > here.
> >
> > * Is it just a notification that a given reassignment has finished that
> > you're interested in?
> > * What are the consequences if such a notification is delayed, or dropped
> > entirely?
> >
> > Regards,
> >
> > Tom
> >
> >
> >
> > On 19 December 2017 at 20:34, Steven Aerts <steven.ae...@gmail.com>
> wrote:
> >
> > > Hello Tom,
> > >
> > >
> > > when you were working out KIP-236, did you consider migrating the
> > > reassignment
> > > state from zookeeper to an internal kafka topic, keyed by partition
> > > and log compacted?
> > >
> > > It would allow an admin client and controller to easily subscribe for
> > > those changes,
> > > without the need to extend the network protocol as discussed in
> KIP-240.
> > >
> > > This is just a theoretical idea I wanted to share, as I can't find a
> > > reason why it would
> > > be a stupid idea.
> > > But I assume that in practice, this will imply too much change to the
> > > code base to be
> > > viable.
> > >
> > >
> > > Regards,
> > >
> > >
> > >    Steven
> > >
> > >
> > > 2017-12-18 11:49 GMT+01:00 Tom Bentley <t.j.bent...@gmail.com>:
> > > > Hi Steven,
> > > >
> > > > I think it would be useful to be able to subscribe yourself on
> updates of
> > > >> reassignment changes.
> > > >
> > > >
> > > > I agree this would be really useful, but, to the extent I understand
> the
> > > > networking underpinnings of the admin client, it might be difficult
> to do
> > > > well in practice. Part of the problem is that you might "set a
> watch" (to
> > > > borrow the ZK terminology) via one broker (or the controller), only
> for
> > > > that broker to fail (or the controller be re-elected). Obviously you
> can
> > > > detect the loss of connection and set a new watch via a different
> broker
> > > > (or the new controller), but that couldn't be transparent to the
> user,
> > > > because the AdminClient doesn't know what changed while it was
> > > > disconnected/not watching.
> > > >
> > > > Another issue is that to avoid races you really need to combine
> fetching
> > > > the current state with setting the watch (as is done in the native
> > > > ZooKeeper API). I think there are lots of subtle issues of this sort
> > > which
> > > > would need to be addressed to make something reliable.
> > > >
> > > > In the mean time, ZooKeeper already has a (proven and mature) API for
> > > > watches, so there is, in principle, a good workaround. I say "in
> > > principle"
> > > > because in the KIP-236 proposal right now the
> /admin/reassign_partitions
> > > > znode is legacy and the reassignment is represented by
> > > > /admin/reassigments/$topic/$partition. That naming scheme for the
> znode
> > > > would make it harder for ZooKeeper clients like yours because such
> > > clients
> > > > would need to set a child watch per topic. The original proposal for
> the
> > > > naming scheme was /admin/reassigments/$topic-$partition, which would
> > > mean
> > > > clients like yours would need only 1 child watch. The advantage of
> > > > /admin/reassigments/$topic/$partition is it scales better. I don't
> > > > currently know how well ZooKeeper copes with nodes with many
> children, so
> > > > it's difficult for me weigh those two options, but I would be happy
> to
> > > > switch back to /admin/reassigments/$topic-$partition if we could
> > > reassure
> > > > ourselves it would scale OK to the reassignment sizes would people
> need
> > > in
> > > > practice.
> > > >
> > > > Overall I would prefer not to tackle something like this in *this*
> KIP,
> > > > though it could be something for a future KIP. Of course I'm happy to
> > > hear
> > > > more discussion about this too!
> > > >
> > > > Cheers,
> > > >
> > > > Tom
> > > >
> > > >
> > > > On 15 December 2017 at 18:51, Steven Aerts <steven.ae...@gmail.com>
> > > wrote:
> > > >
> > > >> Tom,
> > > >>
> > > >>
> > > >> I think it would be useful to be able to subscribe yourself on
> updates
> > > of
> > > >> reassignment changes.
> > > >> Our internal kafka supervisor and monitoring tools are currently
> > > subscribed
> > > >> to these changes in zookeeper so they can babysit our clusters.
> > > >>
> > > >> I think it would be nice if we could receive these events through
> the
> > > >> adminclient.
> > > >> In the api proposal, you can only poll for changes.
> > > >>
> > > >> No clue how difficult it would be to implement, maybe you can
> piggyback
> > > on
> > > >> some version number in the repartition messages or on zookeeper.
> > > >>
> > > >> This is just an idea, not a must have feature for me.  We can always
> > > poll
> > > >> over
> > > >> the proposed api.
> > > >>
> > > >>
> > > >> Regards,
> > > >>
> > > >>
> > > >>    Steven
> > > >>
> > > >>
> > > >> Op vr 15 dec. 2017 om 19:16 schreef Tom Bentley <
> t.j.bent...@gmail.com
> > > >:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > KIP-236 lays the foundations for AdminClient APIs to do with
> partition
> > > >> > reassignment. I'd now like to start discussing KIP-240, which adds
> > > APIs
> > > >> to
> > > >> > the AdminClient to list and describe the current reassignments.
> > > >> >
> > > >> >
> > > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >> 240%3A+AdminClient.listReassignments+AdminClient.
> describeReassignments
> > > >> >
> > > >> > Aside: I have fairly developed ideas for the API for starting a
> > > >> > reassignment, but I intend to put that in a third KIP.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Tom
> > > >> >
> > > >>
> > >
>

Reply via email to