Hi Divij,

Seems like there is some confusion about the new protocol for fetching from
tiered offset.
The scenario you are highlighting is where,
Leader's Log Start Offset = 0
Last Tiered Offset = 10

Following is the sequence of events that will happen:

1. Follower requests offset 0 from the leader
2. Assuming offset 0 is not available locally (to arrive at your scenario),
Leader returns OffsetMovedToTieredStorageException
3. Follower fetches the earliest pending upload offset and receives 11
4. Follower builds aux state from [0-10] and sets the fetch offset to 11
(This step corresponds to step 3 in your email)

At this stage, even if the leader has uploaded more data and the
last-tiered offset has changed (say to 15), it will not matter
because offset 11 should still be available on the leader and when the
follower requests data with fetch offset 11, the leader
will return with a valid partition data response which the follower can
consume and proceed further. Even if the offset 11 is not
available anymore, the follower will eventually be able to catch up with
the leader by resetting its fetch offset until the offset
is available on the leader's local log. Once it catches up, replication on
the follower can proceed.

Regards,
Abhijeet.



On Tue, Jul 2, 2024 at 3:55 PM Divij Vaidya <divijvaidy...@gmail.com> wrote:

> Hi folks.
>
> I am late to the party but I have a question on the proposal.
>
> How are we preventing a situation such as the following:
>
> 1. Empty follower asks leader for 0
> 2. Leader compares 0 with last-tiered-offset, and responds with 11 (where10
> is last-tiered-offset) and a OffsetMovedToTieredException
> 3. Follower builds aux state from [0-10] and sets the fetch offset to 11
> 4. But leader has already uploaded more data and now the new
> last-tiered-offset is 15
> 5. Go back to 2
>
> This could cause a cycle where the replica will be stuck trying to
> reconcile with the leader.
>
> --
> Divij Vaidya
>
>
>
> On Fri, Apr 26, 2024 at 7:28 AM Abhijeet Kumar <abhijeet.cse....@gmail.com
> >
> wrote:
>
> > Thank you all for your comments. As all the comments in the thread are
> > addressed, I am starting a Vote thread for the KIP. Please have a look.
> >
> > Regards.
> > Abhijeet.
> >
> >
> >
> > On Thu, Apr 25, 2024 at 6:08 PM Luke Chen <show...@gmail.com> wrote:
> >
> > > Hi, Abhijeet,
> > >
> > > Thanks for the update.
> > >
> > > I have no more comments.
> > >
> > > Luke
> > >
> > > On Thu, Apr 25, 2024 at 4:21 AM Jun Rao <j...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Abhijeet,
> > > >
> > > > Thanks for the updated KIP. It looks good to me.
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 22, 2024 at 12:08 PM Abhijeet Kumar <
> > > > abhijeet.cse....@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Please find my comments inline.
> > > > >
> > > > >
> > > > > On Thu, Apr 18, 2024 at 3:26 AM Jun Rao <j...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, Abhijeet,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 1. I am wondering if we could achieve the same result by just
> > > lowering
> > > > > > local.retention.ms and local.retention.bytes. This also allows
> the
> > > > newly
> > > > > > started follower to build up the local data before serving the
> > > consumer
> > > > > > traffic.
> > > > > >
> > > > >
> > > > > I am not sure I fully followed this. Do you mean we could lower the
> > > > > local.retention (by size and time)
> > > > > so that there is little data on the leader's local storage so that
> > the
> > > > > follower can quickly catch up with the leader?
> > > > >
> > > > > In that case, we will need to set small local retention across
> > brokers
> > > in
> > > > > the cluster. It will have the undesired
> > > > > effect where there will be increased remote log fetches for serving
> > > > consume
> > > > > requests, and this can cause
> > > > > degradations. Also, this behaviour (of increased remote fetches)
> will
> > > > > happen on all brokers at all times, whereas in
> > > > > the KIP we are restricting the behavior only to the newly
> > bootstrapped
> > > > > brokers and only until the time it fully builds
> > > > > the local logs as per retention defined at the cluster level.
> > > > > (Deprioritization of the broker could help reduce the impact
> > > > >  even further)
> > > > >
> > > > >
> > > > > >
> > > > > > 2. Have you updated the KIP?
> > > > > >
> > > > >
> > > > > The KIP has been updated now.
> > > > >
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Apr 9, 2024 at 3:36 AM Satish Duggana <
> > > > satish.dugg...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > +1 to Jun for adding the consumer fetching from a follower
> > scenario
> > > > > > > also to the existing section that talked about the drawback
> when
> > a
> > > > > > > node built with last-tiered-offset has become a leader. As
> > Abhijeet
> > > > > > > mentioned, we plan to have a follow-up KIP that will address
> > these
> > > by
> > > > > > > having a deprioritzation of these brokers. The deprioritization
> > of
> > > > > > > those brokers can be removed once they catchup until the local
> > log
> > > > > > > retention.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Satish.
> > > > > > >
> > > > > > > On Tue, 9 Apr 2024 at 14:08, Luke Chen <show...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > Hi Abhijeet,
> > > > > > > >
> > > > > > > > Thanks for the KIP to improve the tiered storage feature!
> > > > > > > >
> > > > > > > > Questions:
> > > > > > > > 1. We could also get the "pending-upload-offset" and epoch
> via
> > > > remote
> > > > > > log
> > > > > > > > metadata, instead of adding a new API to fetch from the
> leader.
> > > > Could
> > > > > > you
> > > > > > > > explain why you choose the later approach, instead of the
> > former?
> > > > > > > > 2.
> > > > > > > > > We plan to have a follow-up KIP that will address both the
> > > > > > > > deprioritization
> > > > > > > > of these brokers from leadership, as well as
> > > > > > > > for consumption (when fetching from followers is allowed).
> > > > > > > >
> > > > > > > > I agree with Jun that we might need to make it clear all
> > possible
> > > > > > > drawbacks
> > > > > > > > that could have. So, could we add the drawbacks that Jun
> > > mentioned
> > > > > > about
> > > > > > > > the performance issue when consumer fetch from follower?
> > > > > > > >
> > > > > > > > 3. Could we add "Rejected Alternatives" section to the end of
> > the
> > > > KIP
> > > > > > to
> > > > > > > > add some of them?
> > > > > > > > Like the "ListOffsetRequest" approach VS
> > > > > > "Earliest-Pending-Upload-Offset"
> > > > > > > > approach, or getting the "Earliest-Pending-Upload-Offset"
> from
> > > > remote
> > > > > > log
> > > > > > > > metadata... etc.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > > Luke
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Apr 9, 2024 at 2:25 PM Abhijeet Kumar <
> > > > > > > abhijeet.cse....@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Christo,
> > > > > > > > >
> > > > > > > > > Please find my comments inline.
> > > > > > > > >
> > > > > > > > > On Fri, Apr 5, 2024 at 12:36 PM Christo Lolov <
> > > > > > christolo...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hello Abhijeet and Jun,
> > > > > > > > > >
> > > > > > > > > > I have been mulling this KIP over a bit more in recent
> > days!
> > > > > > > > > >
> > > > > > > > > > re: Jun
> > > > > > > > > >
> > > > > > > > > > I wasn't aware we apply 2.1 and 2.2 for reserving new
> > > > timestamps
> > > > > -
> > > > > > in
> > > > > > > > > > retrospect it should have been fairly obvious. I would
> need
> > > to
> > > > go
> > > > > > an
> > > > > > > > > update
> > > > > > > > > > KIP-1005 myself then, thank you for giving the useful
> > > > reference!
> > > > > > > > > >
> > > > > > > > > > 4. I think Abhijeet wants to rebuild state from
> > > > > > latest-tiered-offset
> > > > > > > and
> > > > > > > > > > fetch from latest-tiered-offset + 1 only for new replicas
> > (or
> > > > > > > replicas
> > > > > > > > > > which experienced a disk failure) to decrease the time a
> > > > > partition
> > > > > > > spends
> > > > > > > > > > in under-replicated state. In other words, a follower
> which
> > > has
> > > > > > just
> > > > > > > > > fallen
> > > > > > > > > > out of ISR, but has local data will continue using
> today's
> > > > Tiered
> > > > > > > Storage
> > > > > > > > > > replication protocol (i.e. fetching from
> earliest-local). I
> > > > > further
> > > > > > > > > believe
> > > > > > > > > > he has taken this approach so that local state of
> replicas
> > > > which
> > > > > > have
> > > > > > > > > just
> > > > > > > > > > fallen out of ISR isn't forcefully wiped thus leading to
> > > > > situation
> > > > > > 1.
> > > > > > > > > > Abhijeet, have I understood (and summarised) what you are
> > > > > proposing
> > > > > > > > > > correctly?
> > > > > > > > > >
> > > > > > > > > > Yes, your understanding is correct. We want to limit the
> > > > behavior
> > > > > > > changes
> > > > > > > > > only to new replicas.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > 5. I think in today's Tiered Storage we know the leader's
> > > > > > > > > log-start-offset
> > > > > > > > > > from the FetchResponse and we can learn its
> > > > > local-log-start-offset
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > ListOffsets by asking for earliest-local timestamp (-4).
> > But
> > > > > > granted,
> > > > > > > > > this
> > > > > > > > > > ought to be added as an additional API call in the KIP.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > Yes, I clarified this in my reply to Jun. I will add this
> > > missing
> > > > > > > detail in
> > > > > > > > > the KIP.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > re: Abhijeet
> > > > > > > > > >
> > > > > > > > > > 101. I am still a bit confused as to why you want to
> > include
> > > a
> > > > > new
> > > > > > > offset
> > > > > > > > > > (i.e. pending-upload-offset) when you yourself mention
> that
> > > you
> > > > > > > could use
> > > > > > > > > > an already existing offset (i.e. last-tiered-offset + 1).
> > In
> > > > > > > essence, you
> > > > > > > > > > end your Motivation with "In this KIP, we will focus only
> > on
> > > > the
> > > > > > > follower
> > > > > > > > > > fetch protocol using the *last-tiered-offset*" and then
> in
> > > the
> > > > > > > following
> > > > > > > > > > sections you talk about pending-upload-offset. I
> understand
> > > > this
> > > > > > > might be
> > > > > > > > > > classified as an implementation detail, but if you
> > introduce
> > > a
> > > > > new
> > > > > > > offset
> > > > > > > > > > (i.e. pending-upload-offset) you have to make a change to
> > the
> > > > > > > ListOffsets
> > > > > > > > > > API (i.e. introduce -6) and thus document it in this KIP
> as
> > > > such.
> > > > > > > > > However,
> > > > > > > > > > the last-tiered-offset ought to already be exposed as
> part
> > of
> > > > > > > KIP-1005
> > > > > > > > > > (under implementation). Am I misunderstanding something
> > here?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I have tried to clarify this in my reply to Jun.
> > > > > > > > >
> > > > > > > > > > The follower needs to build the local data starting from
> > the
> > > > > offset
> > > > > > > > > > Earliest-Pending-Upload-Offset. Hence it needs the offset
> > and
> > > > the
> > > > > > > > > > corresponding leader-epoch.
> > > > > > > > > > There are two ways to do this:
> > > > > > > > > >    1. We add support in ListOffsetRequest to be able to
> > fetch
> > > > > this
> > > > > > > offset
> > > > > > > > > > (and leader epoch) from the leader.
> > > > > > > > > >    2. Or, fetch the tiered-offset (which is already
> > > supported).
> > > > > > From
> > > > > > > this
> > > > > > > > > > offset, we can get the Earliest-Pending-Upload-Offset. We
> > can
> > > > > just
> > > > > > > add 1
> > > > > > > > > to
> > > > > > > > > > the tiered-offset.
> > > > > > > > > >       However, we still need the leader epoch for offset,
> > > since
> > > > > > > there is
> > > > > > > > > > no guarantee that the leader epoch for
> > > > > > Earliest-Pending-Upload-Offset
> > > > > > > > > will
> > > > > > > > > > be the same as the
> > > > > > > > > >       leader epoch for tiered-offset. We may need another
> > API
> > > > > call
> > > > > > > to the
> > > > > > > > > > leader for this.
> > > > > > > > > > I prefer the first approach. The only problem with the
> > first
> > > > > > > approach is
> > > > > > > > > > that it introduces one more offset. The second approach
> > > avoids
> > > > > this
> > > > > > > > > problem
> > > > > > > > > > but is a little complicated.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Christo
> > > > > > > > > >
> > > > > > > > > > On Thu, 4 Apr 2024 at 19:37, Jun Rao
> > > <j...@confluent.io.invalid
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi, Abhijeet,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the KIP. Left a few comments.
> > > > > > > > > > >
> > > > > > > > > > > 1. "A drawback of using the last-tiered-offset is that
> > this
> > > > new
> > > > > > > > > follower
> > > > > > > > > > > would possess only a limited number of locally stored
> > > > segments.
> > > > > > > Should
> > > > > > > > > it
> > > > > > > > > > > ascend to the role of leader, there is a risk of
> needing
> > to
> > > > > fetch
> > > > > > > these
> > > > > > > > > > > segments from the remote storage, potentially impacting
> > > > broker
> > > > > > > > > > > performance."
> > > > > > > > > > > Since we support consumers fetching from followers,
> this
> > > is a
> > > > > > > potential
> > > > > > > > > > > issue on the follower side too. In theory, it's
> possible
> > > for
> > > > a
> > > > > > > segment
> > > > > > > > > to
> > > > > > > > > > > be tiered immediately after rolling. In that case,
> there
> > > > could
> > > > > be
> > > > > > > very
> > > > > > > > > > > little data after last-tiered-offset. It would be
> useful
> > to
> > > > > think
> > > > > > > > > through
> > > > > > > > > > > how to address this issue.
> > > > > > > > > > >
> > > > > > > > > > > 2. ListOffsetsRequest:
> > > > > > > > > > > 2.1 Typically, we need to bump up the version of the
> > > request
> > > > if
> > > > > > we
> > > > > > > add
> > > > > > > > > a
> > > > > > > > > > > new value for timestamp. See
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/pull/10760/files#diff-fac7080d67da905a80126d58fc1745c9a1409de7ef7d093c2ac66a888b134633
> > > > > > > > > > > .
> > > > > > > > > > > 2.2 Since this changes the inter broker request
> protocol,
> > > it
> > > > > > would
> > > > > > > be
> > > > > > > > > > > useful to have a section on upgrade (e.g. new
> > > > > > > IBP/metadata.version).
> > > > > > > > > > >
> > > > > > > > > > > 3. "Instead of fetching Earliest-Pending-Upload-Offset,
> > it
> > > > > could
> > > > > > > fetch
> > > > > > > > > > the
> > > > > > > > > > > last-tiered-offset from the leader, and make a separate
> > > > leader
> > > > > > > call to
> > > > > > > > > > > fetch leader epoch for the following offset."
> > > > > > > > > > > Why do we need to make a separate call for the leader
> > > epoch?
> > > > > > > > > > > ListOffsetsResponse include both the offset and the
> > > > > corresponding
> > > > > > > > > epoch.
> > > > > > > > > > >
> > > > > > > > > > > 4. "Check if the follower replica is empty and if the
> > > feature
> > > > > to
> > > > > > > use
> > > > > > > > > > > last-tiered-offset is enabled."
> > > > > > > > > > > Why do we need to check if the follower replica is
> empty?
> > > > > > > > > > >
> > > > > > > > > > > 5. It can be confirmed by checking if the leader's
> > > > > > > Log-Start-Offset is
> > > > > > > > > > the
> > > > > > > > > > > same as the Leader's Local-Log-Start-Offset.
> > > > > > > > > > > How does the follower know Local-Log-Start-Offset?
> > > > > > > > > > >
> > > > > > > > > > > Jun
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Mar 30, 2024 at 5:51 AM Abhijeet Kumar <
> > > > > > > > > > abhijeet.cse....@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Christo,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for reviewing the KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > The follower needs the earliest-pending-upload-offset
> > > (and
> > > > > the
> > > > > > > > > > > > corresponding leader epoch) from the leader.
> > > > > > > > > > > > This is the first offset the follower will have
> > locally.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > > Abhijeet.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Mar 29, 2024 at 1:14 PM Christo Lolov <
> > > > > > > > > christolo...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Heya!
> > > > > > > > > > > > >
> > > > > > > > > > > > > First of all, thank you very much for the proposal,
> > you
> > > > > have
> > > > > > > > > > explained
> > > > > > > > > > > > the
> > > > > > > > > > > > > problem you want solved very well - I think a
> faster
> > > > > > bootstrap
> > > > > > > of
> > > > > > > > > an
> > > > > > > > > > > > empty
> > > > > > > > > > > > > replica is definitely an improvement!
> > > > > > > > > > > > >
> > > > > > > > > > > > > For my understanding, which concrete offset do you
> > want
> > > > the
> > > > > > > leader
> > > > > > > > > to
> > > > > > > > > > > > give
> > > > > > > > > > > > > back to a follower - earliest-pending-upload-offset
> > or
> > > > the
> > > > > > > > > > > > > latest-tiered-offset? If it is the second, then I
> > > believe
> > > > > > > KIP-1005
> > > > > > > > > > > ought
> > > > > > > > > > > > to
> > > > > > > > > > > > > already be exposing that offset as part of the
> > > > ListOffsets
> > > > > > > API, no?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Christo
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, 27 Mar 2024 at 18:23, Abhijeet Kumar <
> > > > > > > > > > > abhijeet.cse....@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I have created KIP-1023 to introduce follower
> fetch
> > > > from
> > > > > > > tiered
> > > > > > > > > > > offset.
> > > > > > > > > > > > > > This feature will be helpful in significantly
> > > reducing
> > > > > > Kafka
> > > > > > > > > > > > > > rebalance/rebuild times when the cluster is
> enabled
> > > > with
> > > > > > > tiered
> > > > > > > > > > > > storage.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1023%3A+Follower+fetch+from+tiered+offset
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Feedback and suggestions are welcome.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > Abhijeet.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to