Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Dong Lin Wed, 11 Jul 2018 11:00:02 -0700

Hey Jun,

Certainly. We can discuss later after KIP-320 settles.


Thanks!
Dong


On Wed, Jul 11, 2018 at 8:54 AM, Jun Rao <[email protected]> wrote:

> Hi, Dong,
>
> Sorry for the late response. Since KIP-320 is covering some of the similar
> problems described in this KIP, perhaps we can wait until KIP-320 settles
> and see what's still left uncovered in this KIP.
>
> Thanks,
>
> Jun
>
> On Mon, Jun 4, 2018 at 7:03 PM, Dong Lin <[email protected]> wrote:
>
> > Hey Jun,
> >
> > It seems that we have made considerable progress on the discussion of
> > KIP-253 since February. Do you think we should continue the discussion
> > there, or can we continue the voting for this KIP? I am happy to submit
> the
> > PR and move forward the progress for this KIP.
> >
> > Thanks!
> > Dong
> >
> >
> > On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <[email protected]> wrote:
> >
> > > Hey Jun,
> > >
> > > Sure, I will come up with a KIP this week. I think there is a way to
> > allow
> > > partition expansion to arbitrary number without introducing new
> concepts
> > > such as read-only partition or repartition epoch.
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <[email protected]> wrote:
> > >
> > >> Hi, Dong,
> > >>
> > >> Thanks for the reply. The general idea that you had for adding
> > partitions
> > >> is similar to what we had in mind. It would be useful to make this
> more
> > >> general, allowing adding an arbitrary number of partitions (instead of
> > >> just
> > >> doubling) and potentially removing partitions as well. The following
> is
> > >> the
> > >> high level idea from the discussion with Colin, Jason and Ismael.
> > >>
> > >> * To change the number of partitions from X to Y in a topic, the
> > >> controller
> > >> marks all existing X partitions as read-only and creates Y new
> > partitions.
> > >> The new partitions are writable and are tagged with a higher
> repartition
> > >> epoch (RE).
> > >>
> > >> * The controller propagates the new metadata to every broker. Once the
> > >> leader of a partition is marked as read-only, it rejects the produce
> > >> requests on this partition. The producer will then refresh the
> metadata
> > >> and
> > >> start publishing to the new writable partitions.
> > >>
> > >> * The consumers will then be consuming messages in RE order. The
> > consumer
> > >> coordinator will only assign partitions in the same RE to consumers.
> > Only
> > >> after all messages in an RE are consumed, will partitions in a higher
> RE
> > >> be
> > >> assigned to consumers.
> > >>
> > >> As Colin mentioned, if we do the above, we could potentially (1) use a
> > >> globally unique partition id, or (2) use a globally unique topic id to
> > >> distinguish recreated partitions due to topic deletion.
> > >>
> > >> So, perhaps we can sketch out the re-partitioning KIP a bit more and
> see
> > >> if
> > >> there is any overlap with KIP-232. Would you be interested in doing
> > that?
> > >> If not, we can do that next week.
> > >>
> > >> Jun
> > >>
> > >>
> > >> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <[email protected]>
> wrote:
> > >>
> > >> > Hey Jun,
> > >> >
> > >> > Interestingly I am also planning to sketch a KIP to allow partition
> > >> > expansion for keyed topics after this KIP. Since you are already
> doing
> > >> > that, I guess I will just share my high level idea here in case it
> is
> > >> > helpful.
> > >> >
> > >> > The motivation for the KIP is that we currently lose order guarantee
> > for
> > >> > messages with the same key if we expand partitions of keyed topic.
> > >> >
> > >> > The solution can probably be built upon the following ideas:
> > >> >
> > >> > - Partition number of the keyed topic should always be doubled (or
> > >> > multiplied by power of 2). Given that we select a partition based on
> > >> > hash(key) % partitionNum, this should help us ensure that, a message
> > >> > assigned to an existing partition will not be mapped to another
> > existing
> > >> > partition after partition expansion.
> > >> >
> > >> > - Producer includes in the ProduceRequest some information that
> helps
> > >> > ensure that messages produced ti a partition will monotonically
> > >> increase in
> > >> > the partitionNum of the topic. In other words, if broker receives a
> > >> > ProduceRequest and notices that the producer does not know the
> > partition
> > >> > number has increased, broker should reject this request. That
> > >> "information"
> > >> > maybe leaderEpoch, max partitionEpoch of the partitions of the
> topic,
> > or
> > >> > simply partitionNum of the topic. The benefit of this property is
> that
> > >> we
> > >> > can keep the new logic for in-order message consumption entirely in
> > how
> > >> > consumer leader determines the partition -> consumer mapping.
> > >> >
> > >> > - When consumer leader determines partition -> consumer mapping,
> > leader
> > >> > first reads the start position for each partition using
> > >> OffsetFetchRequest.
> > >> > If start position are all non-zero, then assignment can be done in
> its
> > >> > current manner. The assumption is that, a message in the new
> partition
> > >> > should only be consumed after all messages with the same key
> produced
> > >> > before it has been consumed. Since some messages in the new
> partition
> > >> has
> > >> > been consumed, we should not worry about consuming messages
> > >> out-of-order.
> > >> > This benefit of this approach is that we can avoid unnecessary
> > overhead
> > >> in
> > >> > the common case.
> > >> >
> > >> > - If the consumer leader finds that the start position for some
> > >> partition
> > >> > is 0. Say the current partition number is 18 and the partition index
> > is
> > >> 12,
> > >> > then consumer leader should ensure that messages produced to
> partition
> > >> 12 -
> > >> > 18/2 = 3 before the first message of partition 12 is consumed,
> before
> > it
> > >> > assigned partition 12 to any consumer in the consumer group. Since
> we
> > >> have
> > >> > a "information" that is monotonically increasing per partition,
> > consumer
> > >> > can read the value of this information from the first message in
> > >> partition
> > >> > 12, get the offset corresponding to this value in partition 3,
> assign
> > >> > partition except for partition 12 (and probably other new
> partitions)
> > to
> > >> > the existing consumers, waiting for the committed offset to go
> beyond
> > >> this
> > >> > offset for partition 3, and trigger rebalance again so that
> partition
> > 3
> > >> can
> > >> > be reassigned to some consumer.
> > >> >
> > >> >
> > >> > Thanks,
> > >> > Dong
> > >> >
> > >> >
> > >> > On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <[email protected]> wrote:
> > >> >
> > >> > > Hi, Dong,
> > >> > >
> > >> > > Thanks for the KIP. It looks good overall. We are working on a
> > >> separate
> > >> > KIP
> > >> > > for adding partitions while preserving the ordering guarantees.
> That
> > >> may
> > >> > > require another flavor of partition epoch. It's not very clear
> > whether
> > >> > that
> > >> > > partition epoch can be merged with the partition epoch in this
> KIP.
> > >> So,
> > >> > > perhaps you can wait on this a bit until we post the other KIP in
> > the
> > >> > next
> > >> > > few days.
> > >> > >
> > >> > > Jun
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > > +1 on the KIP.
> > >> > > >
> > >> > > > I think the KIP is mainly about adding the capability of
> tracking
> > >> the
> > >> > > > system state change lineage. It does not seem necessary to
> bundle
> > >> this
> > >> > > KIP
> > >> > > > with replacing the topic partition with partition epoch in
> > >> > produce/fetch.
> > >> > > > Replacing topic-partition string with partition epoch is
> > >> essentially a
> > >> > > > performance improvement on top of this KIP. That can probably be
> > >> done
> > >> > > > separately.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jiangjie (Becket) Qin
> > >> > > >
> > >> > > > On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <[email protected]
> >
> > >> > wrote:
> > >> > > >
> > >> > > > > Hey Colin,
> > >> > > > >
> > >> > > > > On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
> > >> [email protected]>
> > >> > > > wrote:
> > >> > > > >
> > >> > > > > > > On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
> > >> [email protected]>
> > >> > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > Hey Colin,
> > >> > > > > > > >
> > >> > > > > > > > I understand that the KIP will adds overhead by
> > introducing
> > >> > > > > > per-partition
> > >> > > > > > > > partitionEpoch. I am open to alternative solutions that
> > does
> > >> > not
> > >> > > > > incur
> > >> > > > > > > > additional overhead. But I don't see a better way now.
> > >> > > > > > > >
> > >> > > > > > > > IMO the overhead in the FetchResponse may not be that
> > much.
> > >> We
> > >> > > > > probably
> > >> > > > > > > > should discuss the percentage increase rather than the
> > >> absolute
> > >> > > > > number
> > >> > > > > > > > increase. Currently after KIP-227, per-partition header
> > has
> > >> 23
> > >> > > > bytes.
> > >> > > > > > This
> > >> > > > > > > > KIP adds another 4 bytes. Assume the records size is
> 10KB,
> > >> the
> > >> > > > > > percentage
> > >> > > > > > > > increase is 4 / (23 + 10000) = 0.03%. It seems
> negligible,
> > >> > right?
> > >> > > > > >
> > >> > > > > > Hi Dong,
> > >> > > > > >
> > >> > > > > > Thanks for the response.  I agree that the FetchRequest /
> > >> > > FetchResponse
> > >> > > > > > overhead should be OK, now that we have incremental fetch
> > >> requests
> > >> > > and
> > >> > > > > > responses.  However, there are a lot of cases where the
> > >> percentage
> > >> > > > > increase
> > >> > > > > > is much greater.  For example, if a client is doing full
> > >> > > > > MetadataRequests /
> > >> > > > > > Responses, we have some math kind of like this per
> partition:
> > >> > > > > >
> > >> > > > > > > UpdateMetadataRequestPartitionState => topic partition
> > >> > > > > controller_epoch
> > >> > > > > > leader  leader_epoch partition_epoch isr zk_version replicas
> > >> > > > > > offline_replicas
> > >> > > > > > > 14 bytes:  topic => string (assuming about 10 byte topic
> > >> names)
> > >> > > > > > > 4 bytes:  partition => int32
> > >> > > > > > > 4  bytes: conroller_epoch => int32
> > >> > > > > > > 4  bytes: leader => int32
> > >> > > > > > > 4  bytes: leader_epoch => int32
> > >> > > > > > > +4 EXTRA bytes: partition_epoch => int32        <-- NEW
> > >> > > > > > > 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
> > >> > > > > > > 4 bytes: zk_version => int32
> > >> > > > > > > 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
> > >> > > > > > > 2  offline_replicas => [int32] (assuming no offline
> > replicas)
> > >> > > > > >
> > >> > > > > > Assuming I added that up correctly, the per-partition
> overhead
> > >> goes
> > >> > > > from
> > >> > > > > > 64 bytes per partition to 68, a 6.2% increase.
> > >> > > > > >
> > >> > > > > > We could do similar math for a lot of the other RPCs.  And
> you
> > >> will
> > >> > > > have
> > >> > > > > a
> > >> > > > > > similar memory and garbage collection impact on the brokers
> > >> since
> > >> > you
> > >> > > > > have
> > >> > > > > > to store all this extra state as well.
> > >> > > > > >
> > >> > > > >
> > >> > > > > That is correct. IMO the Metadata is only updated periodically
> > >> and is
> > >> > > > > probably not a big deal if we increase it by 6%. The
> > FetchResponse
> > >> > and
> > >> > > > > ProduceRequest are probably the only requests that are bounded
> > by
> > >> the
> > >> > > > > bandwidth throughput.
> > >> > > > >
> > >> > > > >
> > >> > > > > >
> > >> > > > > > > >
> > >> > > > > > > > I agree that we can probably save more space by using
> > >> partition
> > >> > > ID
> > >> > > > so
> > >> > > > > > that
> > >> > > > > > > > we no longer needs the string topic name. The similar
> idea
> > >> has
> > >> > > also
> > >> > > > > > been
> > >> > > > > > > > put in the Rejected Alternative section in KIP-227.
> While
> > >> this
> > >> > > idea
> > >> > > > > is
> > >> > > > > > > > promising, it seems orthogonal to the goal of this KIP.
> > >> Given
> > >> > > that
> > >> > > > > > there is
> > >> > > > > > > > already many work to do in this KIP, maybe we can do the
> > >> > > partition
> > >> > > > ID
> > >> > > > > > in a
> > >> > > > > > > > separate KIP?
> > >> > > > > >
> > >> > > > > > I guess my thinking is that the goal here is to replace an
> > >> > identifier
> > >> > > > > > which can be re-used (the tuple of topic name, partition ID)
> > >> with
> > >> > an
> > >> > > > > > identifier that cannot be re-used (the tuple of topic name,
> > >> > partition
> > >> > > > ID,
> > >> > > > > > partition epoch) in order to gain better semantics.  As long
> > as
> > >> we
> > >> > > are
> > >> > > > > > replacing the identifier, why not replace it with an
> > identifier
> > >> > that
> > >> > > > has
> > >> > > > > > important performance advantages?  The KIP freeze for the
> next
> > >> > > release
> > >> > > > > has
> > >> > > > > > already passed, so there is time to do this.
> > >> > > > > >
> > >> > > > >
> > >> > > > > In general it can be easier for discussion and implementation
> if
> > >> we
> > >> > can
> > >> > > > > split a larger task into smaller and independent tasks. For
> > >> example,
> > >> > > > > KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
> > >> KIP-32
> > >> > > and
> > >> > > > > KIP-33 are about timestamp support. The option on this can be
> > >> subject
> > >> > > > > though.
> > >> > > > >
> > >> > > > > IMO the change to switch from (topic, partition ID) to
> > >> partitionEpch
> > >> > in
> > >> > > > all
> > >> > > > > request/response requires us to going through all request one
> by
> > >> one.
> > >> > > It
> > >> > > > > may not be hard but it can be time consuming and tedious. At
> > high
> > >> > level
> > >> > > > the
> > >> > > > > goal and the change for that will be orthogonal to the changes
> > >> > required
> > >> > > > in
> > >> > > > > this KIP. That is the main reason I think we can split them
> into
> > >> two
> > >> > > > KIPs.
> > >> > > > >
> > >> > > > >
> > >> > > > > > On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
> > >> > > > > > > I think it is possible to move to entirely use
> > partitionEpoch
> > >> > > instead
> > >> > > > > of
> > >> > > > > > > (topic, partition) to identify a partition. Client can
> > obtain
> > >> the
> > >> > > > > > > partitionEpoch -> (topic, partition) mapping from
> > >> > MetadataResponse.
> > >> > > > We
> > >> > > > > > > probably need to figure out a way to assign partitionEpoch
> > to
> > >> > > > existing
> > >> > > > > > > partitions in the cluster. But this should be doable.
> > >> > > > > > >
> > >> > > > > > > This is a good idea. I think it will save us some space in
> > the
> > >> > > > > > > request/response. The actual space saving in percentage
> > >> probably
> > >> > > > > depends
> > >> > > > > > on
> > >> > > > > > > the amount of data and the number of partitions of the
> same
> > >> > topic.
> > >> > > I
> > >> > > > > just
> > >> > > > > > > think we can do it in a separate KIP.
> > >> > > > > >
> > >> > > > > > Hmm.  How much extra work would be required?  It seems like
> we
> > >> are
> > >> > > > > already
> > >> > > > > > changing almost every RPC that involves topics and
> partitions,
> > >> > > already
> > >> > > > > > adding new per-partition state to ZooKeeper, already
> changing
> > >> how
> > >> > > > clients
> > >> > > > > > interact with partitions.  Is there some other big piece of
> > work
> > >> > we'd
> > >> > > > > have
> > >> > > > > > to do to move to partition IDs that we wouldn't need for
> > >> partition
> > >> > > > > epochs?
> > >> > > > > > I guess we'd have to find a way to support regular
> > >> expression-based
> > >> > > > topic
> > >> > > > > > subscriptions.  If we split this into multiple KIPs,
> wouldn't
> > we
> > >> > end
> > >> > > up
> > >> > > > > > changing all that RPCs and ZK state a second time?  Also,
> I'm
> > >> > curious
> > >> > > > if
> > >> > > > > > anyone has done any proof of concept GC, memory, and network
> > >> usage
> > >> > > > > > measurements on switching topic names for topic IDs.
> > >> > > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > We will need to go over all requests/responses to check how to
> > >> > replace
> > >> > > > > (topic, partition ID) with partition epoch. It requires
> > >> non-trivial
> > >> > > work
> > >> > > > > and could take time. As you mentioned, we may want to see how
> > much
> > >> > > saving
> > >> > > > > we can get by switching from topic names to partition epoch.
> > That
> > >> > > itself
> > >> > > > > requires time and experiment. It seems that the new idea does
> > not
> > >> > > > rollback
> > >> > > > > any change proposed in this KIP. So I am not sure we can get
> > much
> > >> by
> > >> > > > > putting them into the same KIP.
> > >> > > > >
> > >> > > > > Anyway, if more people are interested in seeing the new idea
> in
> > >> the
> > >> > > same
> > >> > > > > KIP, I can try that.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > >
> > >> > > > > > best,
> > >> > > > > > Colin
> > >> > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
> > >> > > [email protected]
> > >> > > > >
> > >> > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > >> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
> > >> > > > > > > >> > Hey Colin,
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
> > >> > > > > [email protected]>
> > >> > > > > > > >> wrote:
> > >> > > > > > > >> >
> > >> > > > > > > >> > > On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
> > >> > > > > > > >> > > > Hey Colin,
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > Thanks for the comment.
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
> > >> > > > > > [email protected]>
> > >> > > > > > > >> > > wrote:
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > > On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
> > >> > > > > > > >> > > > > > Hey Colin,
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Thanks for reviewing the KIP.
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > If I understand you right, you maybe
> suggesting
> > >> that
> > >> > > we
> > >> > > > > can
> > >> > > > > > use
> > >> > > > > > > >> a
> > >> > > > > > > >> > > global
> > >> > > > > > > >> > > > > > metadataEpoch that is incremented every time
> > >> > > controller
> > >> > > > > > updates
> > >> > > > > > > >> > > metadata.
> > >> > > > > > > >> > > > > > The problem with this solution is that, if a
> > >> topic
> > >> > is
> > >> > > > > > deleted
> > >> > > > > > > >> and
> > >> > > > > > > >> > > created
> > >> > > > > > > >> > > > > > again, user will not know whether that the
> > offset
> > >> > > which
> > >> > > > is
> > >> > > > > > > >> stored
> > >> > > > > > > >> > > before
> > >> > > > > > > >> > > > > > the topic deletion is no longer valid. This
> > >> > motivates
> > >> > > > the
> > >> > > > > > idea
> > >> > > > > > > >> to
> > >> > > > > > > >> > > include
> > >> > > > > > > >> > > > > > per-partition partitionEpoch. Does this sound
> > >> > > > reasonable?
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > Hi Dong,
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > Perhaps we can store the last valid offset of
> > each
> > >> > > deleted
> > >> > > > > > topic
> > >> > > > > > > >> in
> > >> > > > > > > >> > > > > ZooKeeper.  Then, when a topic with one of
> those
> > >> names
> > >> > > > gets
> > >> > > > > > > >> > > re-created, we
> > >> > > > > > > >> > > > > can start the topic at the previous end offset
> > >> rather
> > >> > > than
> > >> > > > > at
> > >> > > > > > 0.
> > >> > > > > > > >> This
> > >> > > > > > > >> > > > > preserves immutability.  It is no more
> burdensome
> > >> than
> > >> > > > > having
> > >> > > > > > to
> > >> > > > > > > >> > > preserve a
> > >> > > > > > > >> > > > > "last epoch" for the deleted partition
> somewhere,
> > >> > right?
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > My concern with this solution is that the number
> of
> > >> > > > zookeeper
> > >> > > > > > nodes
> > >> > > > > > > >> get
> > >> > > > > > > >> > > > more and more over time if some users keep
> deleting
> > >> and
> > >> > > > > creating
> > >> > > > > > > >> topics.
> > >> > > > > > > >> > > Do
> > >> > > > > > > >> > > > you think this can be a problem?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Hi Dong,
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > We could expire the "partition tombstones" after an
> > >> hour
> > >> > or
> > >> > > > so.
> > >> > > > > > In
> > >> > > > > > > >> > > practice this would solve the issue for clients
> that
> > >> like
> > >> > to
> > >> > > > > > destroy
> > >> > > > > > > >> and
> > >> > > > > > > >> > > re-create topics all the time.  In any case,
> doesn't
> > >> the
> > >> > > > current
> > >> > > > > > > >> proposal
> > >> > > > > > > >> > > add per-partition znodes as well that we have to
> > track
> > >> > even
> > >> > > > > after
> > >> > > > > > the
> > >> > > > > > > >> > > partition is deleted?  Or did I misunderstand that?
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > Actually the current KIP does not add per-partition
> > >> znodes.
> > >> > > > Could
> > >> > > > > > you
> > >> > > > > > > >> > double check? I can fix the KIP wiki if there is
> > anything
> > >> > > > > > misleading.
> > >> > > > > > > >>
> > >> > > > > > > >> Hi Dong,
> > >> > > > > > > >>
> > >> > > > > > > >> I double-checked the KIP, and I can see that you are in
> > >> fact
> > >> > > > using a
> > >> > > > > > > >> global counter for initializing partition epochs.  So,
> > you
> > >> are
> > >> > > > > > correct, it
> > >> > > > > > > >> doesn't add per-partition znodes for partitions that no
> > >> longer
> > >> > > > > exist.
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> > If we expire the "partition tomstones" after an hour,
> > and
> > >> > the
> > >> > > > > topic
> > >> > > > > > is
> > >> > > > > > > >> > re-created after more than an hour since the topic
> > >> deletion,
> > >> > > > then
> > >> > > > > > we are
> > >> > > > > > > >> > back to the situation where user can not tell whether
> > the
> > >> > > topic
> > >> > > > > has
> > >> > > > > > been
> > >> > > > > > > >> > re-created or not, right?
> > >> > > > > > > >>
> > >> > > > > > > >> Yes, with an expiration period, it would not ensure
> > >> > > immutability--
> > >> > > > > you
> > >> > > > > > > >> could effectively reuse partition names and they would
> > look
> > >> > the
> > >> > > > > same.
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > It's not really clear to me what should happen
> when a
> > >> > topic
> > >> > > is
> > >> > > > > > > >> destroyed
> > >> > > > > > > >> > > and re-created with new data.  Should consumers
> > >> continue
> > >> > to
> > >> > > be
> > >> > > > > > able to
> > >> > > > > > > >> > > consume?  We don't know where they stopped
> consuming
> > >> from
> > >> > > the
> > >> > > > > > previous
> > >> > > > > > > >> > > incarnation of the topic, so messages may have been
> > >> lost.
> > >> > > > > > Certainly
> > >> > > > > > > >> > > consuming data from offset X of the new incarnation
> > of
> > >> the
> > >> > > > topic
> > >> > > > > > may
> > >> > > > > > > >> give
> > >> > > > > > > >> > > something totally different from what you would
> have
> > >> > gotten
> > >> > > > from
> > >> > > > > > > >> offset X
> > >> > > > > > > >> > > of the previous incarnation of the topic.
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > With the current KIP, if a consumer consumes a topic
> > >> based
> > >> > on
> > >> > > > the
> > >> > > > > > last
> > >> > > > > > > >> > remembered (offset, partitionEpoch, leaderEpoch), and
> > if
> > >> the
> > >> > > > topic
> > >> > > > > > is
> > >> > > > > > > >> > re-created, consume will throw
> > >> > InvalidPartitionEpochException
> > >> > > > > > because
> > >> > > > > > > >> the
> > >> > > > > > > >> > previous partitionEpoch will be different from the
> > >> current
> > >> > > > > > > >> partitionEpoch.
> > >> > > > > > > >> > This is described in the Proposed Changes ->
> > Consumption
> > >> > after
> > >> > > > > topic
> > >> > > > > > > >> > deletion in the KIP. I can improve the KIP if there
> is
> > >> > > anything
> > >> > > > > not
> > >> > > > > > > >> clear.
> > >> > > > > > > >>
> > >> > > > > > > >> Thanks for the clarification.  It sounds like what you
> > >> really
> > >> > > want
> > >> > > > > is
> > >> > > > > > > >> immutability-- i.e., to never "really" reuse partition
> > >> > > > identifiers.
> > >> > > > > > And
> > >> > > > > > > >> you do this by making the partition name no longer the
> > >> "real"
> > >> > > > > > identifier.
> > >> > > > > > > >>
> > >> > > > > > > >> My big concern about this KIP is that it seems like an
> > >> > > > > > anti-scalability
> > >> > > > > > > >> feature.  Now we are adding 4 extra bytes for every
> > >> partition
> > >> > in
> > >> > > > the
> > >> > > > > > > >> FetchResponse and Request, for example.  That could be
> 40
> > >> kb
> > >> > per
> > >> > > > > > request,
> > >> > > > > > > >> if the user has 10,000 partitions.  And of course, the
> > KIP
> > >> > also
> > >> > > > > makes
> > >> > > > > > > >> massive changes to UpdateMetadataRequest,
> > MetadataResponse,
> > >> > > > > > > >> OffsetCommitRequest, OffsetFetchResponse,
> > >> LeaderAndIsrRequest,
> > >> > > > > > > >> ListOffsetResponse, etc. which will also increase their
> > >> size
> > >> > on
> > >> > > > the
> > >> > > > > > wire
> > >> > > > > > > >> and in memory.
> > >> > > > > > > >>
> > >> > > > > > > >> One thing that we talked a lot about in the past is
> > >> replacing
> > >> > > > > > partition
> > >> > > > > > > >> names with IDs.  IDs have a lot of really nice
> features.
> > >> They
> > >> > > > take
> > >> > > > > > up much
> > >> > > > > > > >> less space in memory than strings (especially 2-byte
> Java
> > >> > > > strings).
> > >> > > > > > They
> > >> > > > > > > >> can often be allocated on the stack rather than the
> heap
> > >> > > > (important
> > >> > > > > > when
> > >> > > > > > > >> you are dealing with hundreds of thousands of them).
> > They
> > >> can
> > >> > > be
> > >> > > > > > > >> efficiently deserialized and serialized.  If we use
> > 64-bit
> > >> > ones,
> > >> > > > we
> > >> > > > > > will
> > >> > > > > > > >> never run out of IDs, which means that they can always
> be
> > >> > unique
> > >> > > > per
> > >> > > > > > > >> partition.
> > >> > > > > > > >>
> > >> > > > > > > >> Given that the partition name is no longer the "real"
> > >> > identifier
> > >> > > > for
> > >> > > > > > > >> partitions in the current KIP-232 proposal, why not
> just
> > >> move
> > >> > to
> > >> > > > > using
> > >> > > > > > > >> partition IDs entirely instead of strings?  You have to
> > >> change
> > >> > > all
> > >> > > > > the
> > >> > > > > > > >> messages anyway.  There isn't much point any more to
> > >> carrying
> > >> > > > around
> > >> > > > > > the
> > >> > > > > > > >> partition name in every RPC, since you really need
> (name,
> > >> > epoch)
> > >> > > > to
> > >> > > > > > > >> identify the partition.
> > >> > > > > > > >> Probably the metadata response and a few other messages
> > >> would
> > >> > > have
> > >> > > > > to
> > >> > > > > > > >> still carry the partition name, to allow clients to go
> > from
> > >> > name
> > >> > > > to
> > >> > > > > > id.
> > >> > > > > > > >> But we could mostly forget about the strings.  And then
> > >> this
> > >> > > would
> > >> > > > > be
> > >> > > > > > a
> > >> > > > > > > >> scalability improvement rather than a scalability
> > problem.
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > > By choosing to reuse the same (topic, partition,
> > >> offset)
> > >> > > > > 3-tuple,
> > >> > > > > > we
> > >> > > > > > > >> have
> > >> > > > > > > >> >
> > >> > > > > > > >> > chosen to give up immutability.  That was a really
> bad
> > >> > > decision.
> > >> > > > > > And
> > >> > > > > > > >> now
> > >> > > > > > > >> > > we have to worry about time dependencies, stale
> > cached
> > >> > data,
> > >> > > > and
> > >> > > > > > all
> > >> > > > > > > >> the
> > >> > > > > > > >> > > rest.  We can't completely fix this inside Kafka no
> > >> matter
> > >> > > > what
> > >> > > > > > we do,
> > >> > > > > > > >> > > because not all that cached data is inside Kafka
> > >> itself.
> > >> > > Some
> > >> > > > > of
> > >> > > > > > it
> > >> > > > > > > >> may be
> > >> > > > > > > >> > > in systems that Kafka has sent data to, such as
> other
> > >> > > daemons,
> > >> > > > > SQL
> > >> > > > > > > >> > > databases, streams, and so forth.
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > The current KIP will uniquely identify a message
> using
> > >> > (topic,
> > >> > > > > > > >> partition,
> > >> > > > > > > >> > offset, partitionEpoch) 4-tuple. This addresses the
> > >> message
> > >> > > > > > immutability
> > >> > > > > > > >> > issue that you mentioned. Is there any corner case
> > where
> > >> the
> > >> > > > > message
> > >> > > > > > > >> > immutability is still not preserved with the current
> > KIP?
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > I guess the idea here is that mirror maker should
> > work
> > >> as
> > >> > > > > expected
> > >> > > > > > > >> when
> > >> > > > > > > >> > > users destroy a topic and re-create it with the
> same
> > >> name.
> > >> > > > > That's
> > >> > > > > > > >> kind of
> > >> > > > > > > >> > > tough, though, since in that scenario, mirror maker
> > >> > probably
> > >> > > > > > should
> > >> > > > > > > >> destroy
> > >> > > > > > > >> > > and re-create the topic on the other end, too,
> right?
> > >> > > > > Otherwise,
> > >> > > > > > > >> what you
> > >> > > > > > > >> > > end up with on the other end could be half of one
> > >> > > incarnation
> > >> > > > of
> > >> > > > > > the
> > >> > > > > > > >> topic,
> > >> > > > > > > >> > > and half of another.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > What mirror maker really needs is to be able to
> > follow
> > >> a
> > >> > > > stream
> > >> > > > > of
> > >> > > > > > > >> events
> > >> > > > > > > >> > > about the kafka cluster itself.  We could have some
> > >> master
> > >> > > > topic
> > >> > > > > > > >> which is
> > >> > > > > > > >> > > always present and which contains data about all
> > topic
> > >> > > > > deletions,
> > >> > > > > > > >> > > creations, etc.  Then MM can simply follow this
> topic
> > >> and
> > >> > do
> > >> > > > > what
> > >> > > > > > is
> > >> > > > > > > >> needed.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Then the next question maybe, should we use a
> > >> global
> > >> > > > > > > >> metadataEpoch +
> > >> > > > > > > >> > > > > > per-partition partitionEpoch, instead of
> using
> > >> > > > > per-partition
> > >> > > > > > > >> > > leaderEpoch
> > >> > > > > > > >> > > > > +
> > >> > > > > > > >> > > > > > per-partition leaderEpoch. The former
> solution
> > >> using
> > >> > > > > > > >> metadataEpoch
> > >> > > > > > > >> > > would
> > >> > > > > > > >> > > > > > not work due to the following scenario
> > (provided
> > >> by
> > >> > > > Jun):
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > "Consider the following scenario. In metadata
> > v1,
> > >> > the
> > >> > > > > leader
> > >> > > > > > > >> for a
> > >> > > > > > > >> > > > > > partition is at broker 1. In metadata v2,
> > leader
> > >> is
> > >> > at
> > >> > > > > > broker
> > >> > > > > > > >> 2. In
> > >> > > > > > > >> > > > > > metadata v3, leader is at broker 1 again. The
> > >> last
> > >> > > > > committed
> > >> > > > > > > >> offset
> > >> > > > > > > >> > > in
> > >> > > > > > > >> > > > > v1,
> > >> > > > > > > >> > > > > > v2 and v3 are 10, 20 and 30, respectively. A
> > >> > consumer
> > >> > > is
> > >> > > > > > > >> started and
> > >> > > > > > > >> > > > > reads
> > >> > > > > > > >> > > > > > metadata v1 and reads messages from offset 0
> to
> > >> 25
> > >> > > from
> > >> > > > > > broker
> > >> > > > > > > >> 1. My
> > >> > > > > > > >> > > > > > understanding is that in the current
> proposal,
> > >> the
> > >> > > > > metadata
> > >> > > > > > > >> version
> > >> > > > > > > >> > > > > > associated with offset 25 is v1. The consumer
> > is
> > >> > then
> > >> > > > > > restarted
> > >> > > > > > > >> and
> > >> > > > > > > >> > > > > fetches
> > >> > > > > > > >> > > > > > metadata v2. The consumer tries to read from
> > >> broker
> > >> > 2,
> > >> > > > > > which is
> > >> > > > > > > >> the
> > >> > > > > > > >> > > old
> > >> > > > > > > >> > > > > > leader with the last offset at 20. In this
> > case,
> > >> the
> > >> > > > > > consumer
> > >> > > > > > > >> will
> > >> > > > > > > >> > > still
> > >> > > > > > > >> > > > > > get OffsetOutOfRangeException incorrectly."
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Regarding your comment "For the second
> purpose,
> > >> this
> > >> > > is
> > >> > > > > > "soft
> > >> > > > > > > >> state"
> > >> > > > > > > >> > > > > > anyway.  If the client thinks X is the leader
> > >> but Y
> > >> > is
> > >> > > > > > really
> > >> > > > > > > >> the
> > >> > > > > > > >> > > leader,
> > >> > > > > > > >> > > > > > the client will talk to X, and X will point
> out
> > >> its
> > >> > > > > mistake
> > >> > > > > > by
> > >> > > > > > > >> > > sending
> > >> > > > > > > >> > > > > back
> > >> > > > > > > >> > > > > > a NOT_LEADER_FOR_PARTITION.", it is probably
> no
> > >> > true.
> > >> > > > The
> > >> > > > > > > >> problem
> > >> > > > > > > >> > > here is
> > >> > > > > > > >> > > > > > that the old leader X may still think it is
> the
> > >> > leader
> > >> > > > of
> > >> > > > > > the
> > >> > > > > > > >> > > partition
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > thus it will not send back
> > >> NOT_LEADER_FOR_PARTITION.
> > >> > > The
> > >> > > > > > reason
> > >> > > > > > > >> is
> > >> > > > > > > >> > > > > provided
> > >> > > > > > > >> > > > > > in KAFKA-6262. Can you check if that makes
> > sense?
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > This is solvable with a timeout, right?  If the
> > >> leader
> > >> > > > can't
> > >> > > > > > > >> > > communicate
> > >> > > > > > > >> > > > > with the controller for a certain period of
> time,
> > >> it
> > >> > > > should
> > >> > > > > > stop
> > >> > > > > > > >> > > acting as
> > >> > > > > > > >> > > > > the leader.  We have to solve this problem,
> > >> anyway, in
> > >> > > > order
> > >> > > > > > to
> > >> > > > > > > >> fix
> > >> > > > > > > >> > > all the
> > >> > > > > > > >> > > > > corner cases.
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > Not sure if I fully understand your proposal. The
> > >> > proposal
> > >> > > > > > seems to
> > >> > > > > > > >> > > require
> > >> > > > > > > >> > > > non-trivial changes to our existing leadership
> > >> election
> > >> > > > > > mechanism.
> > >> > > > > > > >> Could
> > >> > > > > > > >> > > > you provide more detail regarding how it works?
> For
> > >> > > example,
> > >> > > > > how
> > >> > > > > > > >> should
> > >> > > > > > > >> > > > user choose this timeout, how leader determines
> > >> whether
> > >> > it
> > >> > > > can
> > >> > > > > > still
> > >> > > > > > > >> > > > communicate with controller, and how this
> triggers
> > >> > > > controller
> > >> > > > > to
> > >> > > > > > > >> elect
> > >> > > > > > > >> > > new
> > >> > > > > > > >> > > > leader?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Before I come up with any proposal, let me make
> sure
> > I
> > >> > > > > understand
> > >> > > > > > the
> > >> > > > > > > >> > > problem correctly.  My big question was, what
> > prevents
> > >> > > > > split-brain
> > >> > > > > > > >> here?
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Let's say I have a partition which is on nodes A,
> B,
> > >> and
> > >> > C,
> > >> > > > with
> > >> > > > > > > >> min-ISR
> > >> > > > > > > >> > > 2.  The controller is D.  At some point, there is a
> > >> > network
> > >> > > > > > partition
> > >> > > > > > > >> > > between A and B and the rest of the cluster.  The
> > >> > Controller
> > >> > > > > > > >> re-assigns the
> > >> > > > > > > >> > > partition to nodes C, D, and E.  But A and B keep
> > >> chugging
> > >> > > > away,
> > >> > > > > > even
> > >> > > > > > > >> > > though they can no longer communicate with the
> > >> controller.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > At some point, a client with stale metadata writes
> to
> > >> the
> > >> > > > > > partition.
> > >> > > > > > > >> It
> > >> > > > > > > >> > > still thinks the partition is on node A, B, and C,
> so
> > >> > that's
> > >> > > > > > where it
> > >> > > > > > > >> sends
> > >> > > > > > > >> > > the data.  It's unable to talk to C, but A and B
> > reply
> > >> > back
> > >> > > > that
> > >> > > > > > all
> > >> > > > > > > >> is
> > >> > > > > > > >> > > well.
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > Is this not a case where we could lose data due to
> > >> split
> > >> > > > brain?
> > >> > > > > > Or is
> > >> > > > > > > >> > > there a mechanism for preventing this that I
> missed?
> > >> If
> > >> > it
> > >> > > > is,
> > >> > > > > it
> > >> > > > > > > >> seems
> > >> > > > > > > >> > > like a pretty serious failure case that we should
> be
> > >> > > handling
> > >> > > > > > with our
> > >> > > > > > > >> > > metadata rework.  And I think epoch numbers and
> > >> timeouts
> > >> > > might
> > >> > > > > be
> > >> > > > > > > >> part of
> > >> > > > > > > >> > > the solution.
> > >> > > > > > > >> > >
> > >> > > > > > > >> >
> > >> > > > > > > >> > Right, split brain can happen if RF=4 and minIsr=2.
> > >> > However, I
> > >> > > > am
> > >> > > > > > not
> > >> > > > > > > >> sure
> > >> > > > > > > >> > it is a pretty serious issue which we need to address
> > >> today.
> > >> > > > This
> > >> > > > > > can be
> > >> > > > > > > >> > prevented by configuring the Kafka topic so that
> > minIsr >
> > >> > > RF/2.
> > >> > > > > > > >> Actually,
> > >> > > > > > > >> > if user sets minIsr=2, is there anything reason that
> > user
> > >> > > wants
> > >> > > > to
> > >> > > > > > set
> > >> > > > > > > >> RF=4
> > >> > > > > > > >> > instead of 4?
> > >> > > > > > > >> >
> > >> > > > > > > >> > Introducing timeout in leader election mechanism is
> > >> > > > non-trivial. I
> > >> > > > > > > >> think we
> > >> > > > > > > >> > probably want to do that only if there is good
> use-case
> > >> that
> > >> > > can
> > >> > > > > not
> > >> > > > > > > >> > otherwise be addressed with the current mechanism.
> > >> > > > > > > >>
> > >> > > > > > > >> I still would like to think about these corner cases
> > more.
> > >> > But
> > >> > > > > > perhaps
> > >> > > > > > > >> it's not directly related to this KIP.
> > >> > > > > > > >>
> > >> > > > > > > >> regards,
> > >> > > > > > > >> Colin
> > >> > > > > > > >>
> > >> > > > > > > >>
> > >> > > > > > > >> >
> > >> > > > > > > >> >
> > >> > > > > > > >> > > best,
> > >> > > > > > > >> > > Colin
> > >> > > > > > > >> > >
> > >> > > > > > > >> > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > >
> > >> > > > > > > >> > > > > best,
> > >> > > > > > > >> > > > > Colin
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > Regards,
> > >> > > > > > > >> > > > > > Dong
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > On Wed, Jan 24, 2018 at 10:39 AM, Colin
> McCabe
> > <
> > >> > > > > > > >> [email protected]>
> > >> > > > > > > >> > > > > wrote:
> > >> > > > > > > >> > > > > >
> > >> > > > > > > >> > > > > > > Hi Dong,
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Thanks for proposing this KIP.  I think a
> > >> metadata
> > >> > > > epoch
> > >> > > > > > is a
> > >> > > > > > > >> > > really
> > >> > > > > > > >> > > > > good
> > >> > > > > > > >> > > > > > > idea.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > I read through the DISCUSS thread, but I
> > still
> > >> > don't
> > >> > > > > have
> > >> > > > > > a
> > >> > > > > > > >> clear
> > >> > > > > > > >> > > > > picture
> > >> > > > > > > >> > > > > > > of why the proposal uses a metadata epoch
> per
> > >> > > > partition
> > >> > > > > > rather
> > >> > > > > > > >> > > than a
> > >> > > > > > > >> > > > > > > global metadata epoch.  A metadata epoch
> per
> > >> > > partition
> > >> > > > > is
> > >> > > > > > > >> kind of
> > >> > > > > > > >> > > > > > > unpleasant-- it's at least 4 extra bytes
> per
> > >> > > partition
> > >> > > > > > that we
> > >> > > > > > > >> > > have to
> > >> > > > > > > >> > > > > send
> > >> > > > > > > >> > > > > > > over the wire in every full metadata
> request,
> > >> > which
> > >> > > > > could
> > >> > > > > > > >> become
> > >> > > > > > > >> > > extra
> > >> > > > > > > >> > > > > > > kilobytes on the wire when the number of
> > >> > partitions
> > >> > > > > > becomes
> > >> > > > > > > >> large.
> > >> > > > > > > >> > > > > Plus,
> > >> > > > > > > >> > > > > > > we have to update all the auxillary classes
> > to
> > >> > > include
> > >> > > > > an
> > >> > > > > > > >> epoch.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > We need to have a global metadata epoch
> > anyway
> > >> to
> > >> > > > handle
> > >> > > > > > > >> partition
> > >> > > > > > > >> > > > > > > addition and deletion.  For example, if I
> > give
> > >> you
> > >> > > > > > > >> > > > > > > MetadataResponse{part1,epoch 1, part2,
> epoch
> > 1}
> > >> > and
> > >> > > > > > {part1,
> > >> > > > > > > >> > > epoch1},
> > >> > > > > > > >> > > > > which
> > >> > > > > > > >> > > > > > > MetadataResponse is newer?  You have no way
> > of
> > >> > > > knowing.
> > >> > > > > > It
> > >> > > > > > > >> could
> > >> > > > > > > >> > > be
> > >> > > > > > > >> > > > > that
> > >> > > > > > > >> > > > > > > part2 has just been created, and the
> response
> > >> > with 2
> > >> > > > > > > >> partitions is
> > >> > > > > > > >> > > > > newer.
> > >> > > > > > > >> > > > > > > Or it coudl be that part2 has just been
> > >> deleted,
> > >> > and
> > >> > > > > > > >> therefore the
> > >> > > > > > > >> > > > > response
> > >> > > > > > > >> > > > > > > with 1 partition is newer.  You must have a
> > >> global
> > >> > > > epoch
> > >> > > > > > to
> > >> > > > > > > >> > > > > disambiguate
> > >> > > > > > > >> > > > > > > these two cases.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > Previously, I worked on the Ceph
> distributed
> > >> > > > filesystem.
> > >> > > > > > > >> Ceph had
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > > > concept of a map of the whole cluster,
> > >> maintained
> > >> > > by a
> > >> > > > > few
> > >> > > > > > > >> servers
> > >> > > > > > > >> > > > > doing
> > >> > > > > > > >> > > > > > > paxos.  This map was versioned by a single
> > >> 64-bit
> > >> > > > epoch
> > >> > > > > > number
> > >> > > > > > > >> > > which
> > >> > > > > > > >> > > > > > > increased on every change.  It was
> propagated
> > >> to
> > >> > > > clients
> > >> > > > > > > >> through
> > >> > > > > > > >> > > > > gossip.  I
> > >> > > > > > > >> > > > > > > wonder if something similar could work
> here?
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > It seems like the the Kafka
> MetadataResponse
> > >> > serves
> > >> > > > two
> > >> > > > > > > >> somewhat
> > >> > > > > > > >> > > > > unrelated
> > >> > > > > > > >> > > > > > > purposes.  Firstly, it lets clients know
> what
> > >> > > > partitions
> > >> > > > > > > >> exist in
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > > > system and where they live.  Secondly, it
> > lets
> > >> > > clients
> > >> > > > > > know
> > >> > > > > > > >> which
> > >> > > > > > > >> > > nodes
> > >> > > > > > > >> > > > > > > within the partition are in-sync (in the
> ISR)
> > >> and
> > >> > > > which
> > >> > > > > > node
> > >> > > > > > > >> is the
> > >> > > > > > > >> > > > > leader.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > The first purpose is what you really need a
> > >> > metadata
> > >> > > > > epoch
> > >> > > > > > > >> for, I
> > >> > > > > > > >> > > > > think.
> > >> > > > > > > >> > > > > > > You want to know whether a partition exists
> > or
> > >> > not,
> > >> > > or
> > >> > > > > you
> > >> > > > > > > >> want to
> > >> > > > > > > >> > > know
> > >> > > > > > > >> > > > > > > which nodes you should talk to in order to
> > >> write
> > >> > to
> > >> > > a
> > >> > > > > > given
> > >> > > > > > > >> > > > > partition.  A
> > >> > > > > > > >> > > > > > > single metadata epoch for the whole
> response
> > >> > should
> > >> > > be
> > >> > > > > > > >> adequate
> > >> > > > > > > >> > > here.
> > >> > > > > > > >> > > > > We
> > >> > > > > > > >> > > > > > > should not change the partition assignment
> > >> without
> > >> > > > going
> > >> > > > > > > >> through
> > >> > > > > > > >> > > > > zookeeper
> > >> > > > > > > >> > > > > > > (or a similar system), and this inherently
> > >> > > serializes
> > >> > > > > > updates
> > >> > > > > > > >> into
> > >> > > > > > > >> > > a
> > >> > > > > > > >> > > > > > > numbered stream.  Brokers should also stop
> > >> > > responding
> > >> > > > to
> > >> > > > > > > >> requests
> > >> > > > > > > >> > > when
> > >> > > > > > > >> > > > > they
> > >> > > > > > > >> > > > > > > are unable to contact ZK for a certain time
> > >> > period.
> > >> > > > > This
> > >> > > > > > > >> prevents
> > >> > > > > > > >> > > the
> > >> > > > > > > >> > > > > case
> > >> > > > > > > >> > > > > > > where a given partition has been moved off
> > some
> > >> > set
> > >> > > of
> > >> > > > > > nodes,
> > >> > > > > > > >> but a
> > >> > > > > > > >> > > > > client
> > >> > > > > > > >> > > > > > > still ends up talking to those nodes and
> > >> writing
> > >> > > data
> > >> > > > > > there.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > For the second purpose, this is "soft
> state"
> > >> > anyway.
> > >> > > > If
> > >> > > > > > the
> > >> > > > > > > >> client
> > >> > > > > > > >> > > > > thinks
> > >> > > > > > > >> > > > > > > X is the leader but Y is really the leader,
> > the
> > >> > > client
> > >> > > > > > will
> > >> > > > > > > >> talk
> > >> > > > > > > >> > > to X,
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > > X will point out its mistake by sending
> back
> > a
> > >> > > > > > > >> > > > > NOT_LEADER_FOR_PARTITION.
> > >> > > > > > > >> > > > > > > Then the client can update its metadata
> again
> > >> and
> > >> > > find
> > >> > > > > > the new
> > >> > > > > > > >> > > leader,
> > >> > > > > > > >> > > > > if
> > >> > > > > > > >> > > > > > > there is one.  There is no need for an
> epoch
> > to
> > >> > > handle
> > >> > > > > > this.
> > >> > > > > > > >> > > > > Similarly, I
> > >> > > > > > > >> > > > > > > can't think of a reason why changing the
> > >> in-sync
> > >> > > > replica
> > >> > > > > > set
> > >> > > > > > > >> needs
> > >> > > > > > > >> > > to
> > >> > > > > > > >> > > > > bump
> > >> > > > > > > >> > > > > > > the epoch.
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > best,
> > >> > > > > > > >> > > > > > > Colin
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > > > > On Wed, Jan 24, 2018, at 09:45, Dong Lin
> > wrote:
> > >> > > > > > > >> > > > > > > > Thanks much for reviewing the KIP!
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > Dong
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
> > >> Wang <
> > >> > > > > > > >> > > [email protected]>
> > >> > > > > > > >> > > > > > > wrote:
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > > Yeah that makes sense, again I'm just
> > >> making
> > >> > > sure
> > >> > > > we
> > >> > > > > > > >> understand
> > >> > > > > > > >> > > > > all the
> > >> > > > > > > >> > > > > > > > > scenarios and what to expect.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > I agree that if, more generally
> speaking,
> > >> say
> > >> > > > users
> > >> > > > > > have
> > >> > > > > > > >> only
> > >> > > > > > > >> > > > > consumed
> > >> > > > > > > >> > > > > > > to
> > >> > > > > > > >> > > > > > > > > offset 8, and then call seek(16) to
> > "jump"
> > >> to
> > >> > a
> > >> > > > > > further
> > >> > > > > > > >> > > position,
> > >> > > > > > > >> > > > > then
> > >> > > > > > > >> > > > > > > she
> > >> > > > > > > >> > > > > > > > > needs to be aware that OORE maybe
> thrown
> > >> and
> > >> > she
> > >> > > > > > needs to
> > >> > > > > > > >> > > handle
> > >> > > > > > > >> > > > > it or
> > >> > > > > > > >> > > > > > > rely
> > >> > > > > > > >> > > > > > > > > on reset policy which should not
> surprise
> > >> her.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > I'm +1 on the KIP.
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > Guozhang
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > On Wed, Jan 24, 2018 at 12:31 AM, Dong
> > Lin
> > >> <
> > >> > > > > > > >> > > [email protected]>
> > >> > > > > > > >> > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > Yes, in general we can not prevent
> > >> > > > > > > >> OffsetOutOfRangeException
> > >> > > > > > > >> > > if
> > >> > > > > > > >> > > > > user
> > >> > > > > > > >> > > > > > > > > seeks
> > >> > > > > > > >> > > > > > > > > > to a wrong offset. The main goal is
> to
> > >> > prevent
> > >> > > > > > > >> > > > > > > OffsetOutOfRangeException
> > >> > > > > > > >> > > > > > > > > if
> > >> > > > > > > >> > > > > > > > > > user has done things in the right
> way,
> > >> e.g.
> > >> > > user
> > >> > > > > > should
> > >> > > > > > > >> know
> > >> > > > > > > >> > > that
> > >> > > > > > > >> > > > > > > there
> > >> > > > > > > >> > > > > > > > > is
> > >> > > > > > > >> > > > > > > > > > message with this offset.
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > For example, if user calls seek(..)
> > right
> > >> > > after
> > >> > > > > > > >> > > construction, the
> > >> > > > > > > >> > > > > > > only
> > >> > > > > > > >> > > > > > > > > > reason I can think of is that user
> > stores
> > >> > > offset
> > >> > > > > > > >> externally.
> > >> > > > > > > >> > > In
> > >> > > > > > > >> > > > > this
> > >> > > > > > > >> > > > > > > > > case,
> > >> > > > > > > >> > > > > > > > > > user currently needs to use the
> offset
> > >> which
> > >> > > is
> > >> > > > > > obtained
> > >> > > > > > > >> > > using
> > >> > > > > > > >> > > > > > > > > position(..)
> > >> > > > > > > >> > > > > > > > > > from the last run. With this KIP,
> user
> > >> needs
> > >> > > to
> > >> > > > > get
> > >> > > > > > the
> > >> > > > > > > >> > > offset
> > >> > > > > > > >> > > > > and
> > >> > > > > > > >> > > > > > > the
> > >> > > > > > > >> > > > > > > > > > offsetEpoch using
> > >> > positionAndOffsetEpoch(...)
> > >> > > > and
> > >> > > > > > stores
> > >> > > > > > > >> > > these
> > >> > > > > > > >> > > > > > > > > information
> > >> > > > > > > >> > > > > > > > > > externally. The next time user starts
> > >> > > consumer,
> > >> > > > > > he/she
> > >> > > > > > > >> needs
> > >> > > > > > > >> > > to
> > >> > > > > > > >> > > > > call
> > >> > > > > > > >> > > > > > > > > > seek(..., offset, offsetEpoch) right
> > >> after
> > >> > > > > > construction.
> > >> > > > > > > >> > > Then KIP
> > >> > > > > > > >> > > > > > > should
> > >> > > > > > > >> > > > > > > > > be
> > >> > > > > > > >> > > > > > > > > > able to ensure that we don't throw
> > >> > > > > > > >> OffsetOutOfRangeException
> > >> > > > > > > >> > > if
> > >> > > > > > > >> > > > > > > there is
> > >> > > > > > > >> > > > > > > > > no
> > >> > > > > > > >> > > > > > > > > > unclean leader election.
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > Does this sound OK?
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > Regards,
> > >> > > > > > > >> > > > > > > > > > Dong
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > On Tue, Jan 23, 2018 at 11:44 PM,
> > >> Guozhang
> > >> > > Wang
> > >> > > > <
> > >> > > > > > > >> > > > > [email protected]>
> > >> > > > > > > >> > > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > "If consumer wants to consume
> message
> > >> with
> > >> > > > > offset
> > >> > > > > > 16,
> > >> > > > > > > >> then
> > >> > > > > > > >> > > > > consumer
> > >> > > > > > > >> > > > > > > > > must
> > >> > > > > > > >> > > > > > > > > > > have
> > >> > > > > > > >> > > > > > > > > > > already fetched message with offset
> > 15"
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > --> this may not be always true
> > right?
> > >> > What
> > >> > > if
> > >> > > > > > > >> consumer
> > >> > > > > > > >> > > just
> > >> > > > > > > >> > > > > call
> > >> > > > > > > >> > > > > > > > > > seek(16)
> > >> > > > > > > >> > > > > > > > > > > after construction and then poll
> > >> without
> > >> > > > > committed
> > >> > > > > > > >> offset
> > >> > > > > > > >> > > ever
> > >> > > > > > > >> > > > > > > stored
> > >> > > > > > > >> > > > > > > > > > > before? Admittedly it is rare but
> we
> > do
> > >> > not
> > >> > > > > > > >> programmably
> > >> > > > > > > >> > > > > disallow
> > >> > > > > > > >> > > > > > > it.
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > Guozhang
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > On Tue, Jan 23, 2018 at 10:42 PM,
> > Dong
> > >> > Lin <
> > >> > > > > > > >> > > > > [email protected]>
> > >> > > > > > > >> > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Hey Guozhang,
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Thanks much for reviewing the
> KIP!
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > In the scenario you described,
> > let's
> > >> > > assume
> > >> > > > > that
> > >> > > > > > > >> broker
> > >> > > > > > > >> > > A has
> > >> > > > > > > >> > > > > > > > > messages
> > >> > > > > > > >> > > > > > > > > > > with
> > >> > > > > > > >> > > > > > > > > > > > offset up to 10, and broker B has
> > >> > messages
> > >> > > > > with
> > >> > > > > > > >> offset
> > >> > > > > > > >> > > up to
> > >> > > > > > > >> > > > > 20.
> > >> > > > > > > >> > > > > > > If
> > >> > > > > > > >> > > > > > > > > > > > consumer wants to consume message
> > >> with
> > >> > > > offset
> > >> > > > > > 9, it
> > >> > > > > > > >> will
> > >> > > > > > > >> > > not
> > >> > > > > > > >> > > > > > > receive
> > >> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
> > >> > > > > > > >> > > > > > > > > > > > from broker A.
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > If consumer wants to consume
> > message
> > >> > with
> > >> > > > > offset
> > >> > > > > > > >> 16, then
> > >> > > > > > > >> > > > > > > consumer
> > >> > > > > > > >> > > > > > > > > must
> > >> > > > > > > >> > > > > > > > > > > > have already fetched message with
> > >> offset
> > >> > > 15,
> > >> > > > > > which
> > >> > > > > > > >> can
> > >> > > > > > > >> > > only
> > >> > > > > > > >> > > > > come
> > >> > > > > > > >> > > > > > > from
> > >> > > > > > > >> > > > > > > > > > > > broker B. Because consumer will
> > fetch
> > >> > from
> > >> > > > > > broker B
> > >> > > > > > > >> only
> > >> > > > > > > >> > > if
> > >> > > > > > > >> > > > > > > > > leaderEpoch
> > >> > > > > > > >> > > > > > > > > > > >=
> > >> > > > > > > >> > > > > > > > > > > > 2, then the current consumer
> > >> leaderEpoch
> > >> > > can
> > >> > > > > > not be
> > >> > > > > > > >> 1
> > >> > > > > > > >> > > since
> > >> > > > > > > >> > > > > this
> > >> > > > > > > >> > > > > > > KIP
> > >> > > > > > > >> > > > > > > > > > > > prevents leaderEpoch rewind. Thus
> > we
> > >> > will
> > >> > > > not
> > >> > > > > > have
> > >> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
> > >> > > > > > > >> > > > > > > > > > > > in this case.
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Does this address your question,
> or
> > >> > maybe
> > >> > > > > there
> > >> > > > > > is
> > >> > > > > > > >> more
> > >> > > > > > > >> > > > > advanced
> > >> > > > > > > >> > > > > > > > > > scenario
> > >> > > > > > > >> > > > > > > > > > > > that the KIP does not handle?
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > Thanks,
> > >> > > > > > > >> > > > > > > > > > > > Dong
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > On Tue, Jan 23, 2018 at 9:43 PM,
> > >> > Guozhang
> > >> > > > > Wang <
> > >> > > > > > > >> > > > > > > [email protected]>
> > >> > > > > > > >> > > > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > Thanks Dong, I made a pass over
> > the
> > >> > wiki
> > >> > > > and
> > >> > > > > > it
> > >> > > > > > > >> lgtm.
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > Just a quick question: can we
> > >> > completely
> > >> > > > > > > >> eliminate the
> > >> > > > > > > >> > > > > > > > > > > > > OffsetOutOfRangeException with
> > this
> > >> > > > > approach?
> > >> > > > > > Say
> > >> > > > > > > >> if
> > >> > > > > > > >> > > there
> > >> > > > > > > >> > > > > is
> > >> > > > > > > >> > > > > > > > > > > consecutive
> > >> > > > > > > >> > > > > > > > > > > > > leader changes such that the
> > cached
> > >> > > > > metadata's
> > >> > > > > > > >> > > partition
> > >> > > > > > > >> > > > > epoch
> > >> > > > > > > >> > > > > > > is
> > >> > > > > > > >> > > > > > > > > 1,
> > >> > > > > > > >> > > > > > > > > > > and
> > >> > > > > > > >> > > > > > > > > > > > > the metadata fetch response
> > returns
> > >> > > with
> > >> > > > > > > >> partition
> > >> > > > > > > >> > > epoch 2
> > >> > > > > > > >> > > > > > > > > pointing
> > >> > > > > > > >> > > > > > > > > > to
> > >> > > > > > > >> > > > > > > > > > > > > leader broker A, while the
> actual
> > >> > > > up-to-date
> > >> > > > > > > >> metadata
> > >> > > > > > > >> > > has
> > >> > > > > > > >> > > > > > > partition
> > >> > > > > > > >> > > > > > > > > > > > epoch 3
> > >> > > > > > > >> > > > > > > > > > > > > whose leader is now broker B,
> the
> > >> > > metadata
> > >> > > > > > > >> refresh will
> > >> > > > > > > >> > > > > still
> > >> > > > > > > >> > > > > > > > > succeed
> > >> > > > > > > >> > > > > > > > > > > and
> > >> > > > > > > >> > > > > > > > > > > > > the follow-up fetch request may
> > >> still
> > >> > > see
> > >> > > > > > OORE?
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > Guozhang
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > On Tue, Jan 23, 2018 at 3:47
> PM,
> > >> Dong
> > >> > > Lin
> > >> > > > <
> > >> > > > > > > >> > > > > [email protected]
> > >> > > > > > > >> > > > > > > >
> > >> > > > > > > >> > > > > > > > > > wrote:
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > Hi all,
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > I would like to start the
> > voting
> > >> > > process
> > >> > > > > for
> > >> > > > > > > >> KIP-232:
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/
> > >> > > > > > > >> > > confluence/display/KAFKA/KIP-
> > >> > > > > > > >> > > > > > > > > > > > > >
> 232%3A+Detect+outdated+metadat
> > >> > > > > > > >> a+using+leaderEpoch+
> > >> > > > > > > >> > > > > > > > > > and+partitionEpoch
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > The KIP will help fix a
> > >> concurrency
> > >> > > > issue
> > >> > > > > in
> > >> > > > > > > >> Kafka
> > >> > > > > > > >> > > which
> > >> > > > > > > >> > > > > > > > > currently
> > >> > > > > > > >> > > > > > > > > > > can
> > >> > > > > > > >> > > > > > > > > > > > > > cause message loss or message
> > >> > > > duplication
> > >> > > > > in
> > >> > > > > > > >> > > consumer.
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > > Regards,
> > >> > > > > > > >> > > > > > > > > > > > > > Dong
> > >> > > > > > > >> > > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > > > > > -- Guozhang
> > >> > > > > > > >> > > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > > > -- Guozhang
> > >> > > > > > > >> > > > > > > > > > >
> > >> > > > > > > >> > > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > > > > --
> > >> > > > > > > >> > > > > > > > > -- Guozhang
> > >> > > > > > > >> > > > > > > > >
> > >> > > > > > > >> > > > > > >
> > >> > > > > > > >> > > > >
> > >> > > > > > > >> > >
> > >> > > > > > > >>
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Reply via email to