Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Satish Duggana Fri, 04 Sep 2020 08:42:06 -0700

Hi Jun,
Thanks for your thorough review and comments. Please find the inline
replies below.


600. The topic deletion logic needs more details.
600.1 The KIP mentions "The controller considers the topic partition is
deleted only when it determines that there are no log segments for that
topic partition by using RLMM". How is this done?

It uses RLMM#listSegments() returns all the segments for the given
topic partition.

600.2 "If the delete option is enabled then the leader will stop RLM task
and stop processing and it sets all the remote log segment metadata of that
partition with a delete marker and publishes them to RLMM." We discussed
this earlier. When a topic is being deleted, there may not be a leader for
the deleted partition.

This is a good point. As suggested in the meeting, we will add a
separate section for topic/partition deletion lifecycle and this
scenario will be addressed.

601. Unclean leader election
601.1 Scenario 1: new empty follower
After step 1, the follower restores up to offset 3. So why does it have
LE-2 at offset 5?

Nice catch. It was showing the leader epoch fetched from the remote
storage. It should be shown with the truncated till offset 3. Updated
the KIP.

601.2 senario 5: After Step 3, leader A has inconsistent data between its
local and the tiered data. For example. offset 3 has msg 3 LE-0 locally,
but msg 5 LE-1 in the remote store. While it's ok for the unclean leader to
lose data, it should still return consistent data, whether it's from the
local or the remote store.

There is no inconsistency here as LE-0 offsets are [0, 4] and LE-2:
[5, ]. It will always get the right records for the given offset and
leader epoch. In case of remote, RSM is invoked to get the remote log
segment that contains the given offset with the leader epoch.

601.4 It seems that retention is based on
listRemoteLogSegments(TopicPartition topicPartition, long leaderEpoch).
When there is an unclean leader election, it's possible for the new leader
to not to include certain epochs in its epoch cache. How are remote
segments associated with those epochs being cleaned?

That is a good point. This leader will also cleanup the epochs earlier
to its start leader epoch and delete those segments. It gets the
earliest epoch for a partition and starts deleting segments from that
leader epoch. We need one more API in RLMM to get the earliest leader
epoch.

601.5 The KIP discusses the handling of unclean leader elections for user
topics. What about unclean leader elections on
__remote_log_segment_metadata?
This is the same as other system topics like consumer_offsets,
__transaction_state topics. As discussed in the meeting, we will add
the behavior of __remote_log_segment_metadata topic’s unclean leader
truncation.

602. It would be useful to clarify the limitations in the initial release.
The KIP mentions not supporting compacted topics. What about JBOD and
changing the configuration of a topic from delete to compact
after remote.log.storage.enable is enabled?

This was updated in the KIP earlier.

603. RLM leader tasks:
603.1"It checks for rolled over LogSegments (which have the last message
offset less than last stable offset of that topic partition) and copies
them along with their offset/time/transaction indexes and leader epoch
cache to the remote tier." It needs to copy the producer snapshot too.

Right. It copies producer snapshots too as mentioned in LogSegmentData.

603.2 "Local logs are not cleaned up till those segments are copied
successfully to remote even though their retention time/size is reached"
This seems weird. If the tiering stops because the remote store is not
available, we don't want the local data to grow forever.

It was clarified in the discussion that the comment was more about the
local storage goes beyond the log.retention. The above statement is
about local.log.retention but not for the complete  log.retention.
When it reaches the log.retention then it will delete the local logs
even though those are not copied to remote storage.


604. "RLM maintains a bounded cache(possibly LRU) of the index files of
remote log segments to avoid multiple index fetches from the remote
storage. These indexes can be used in the same way as local segment indexes
are used." Could you provide more details on this? Are the indexes cached
in memory or on disk? If on disk, where are they stored? Are the cached
indexes bound by a certain size?

These are cached on disk and stored in log.dir with a name
“__remote_log_index_cache”. They are bound by the total size. This
will be exposed as a user configuration,

605. BuildingRemoteLogAux
605.1 In this section, two options are listed. Which one is chosen?
Option-2, updated the KIP.

605.2 In option 2, it says  "Build the local leader epoch cache by cutting
the leader epoch sequence received from remote storage to [LSO, ELO]. (LSO
= log start offset)." We need to do the same thing for the producer
snapshot. However, it's hard to cut the producer snapshot to an earlier
offset. Another option is to simply take the lastOffset from the remote
segment and use that as the starting fetch offset in the follower. This
avoids the need for cutting.

Right,  this was mentioned in the “transactional support” section
about adding these details.

606. ListOffsets: Since we need a version bump, could you document it under
a protocol change section?

Sure, we will update the KIP.

607. "LogStartOffset of a topic can point to either of local segment or
remote segment but it is initialised and maintained in the Log class like
now. This is already maintained in `Log` class while loading the logs and
it can also be fetched from RemoteLogMetadataManager." What will happen to
the existing logic (e.g. log recovery) that currently depends on
logStartOffset but assumes it's local?

They use a field called localLogStartOffset which is the local log
start offset..

608. Handle expired remote segment: How does it pick up new logStartOffset
from deleteRecords?

Good point. This was not addressed in the KIP. Will update the KIP on
how the RLM task handles this scenario.

609. RLMM message format:
609.1 It includes both MaxTimestamp and EventTimestamp. Where does it get
both since the message in the log only contains one timestamp?

`EventTimeStamp` is the timestamp at which that segment metadata event
is generated. This is more for audits.

609.2 If we change just the state (e.g. to DELETE_STARTED), it seems it's
wasteful to have to include all other fields not changed.

This is a good point. We thought about incremental updates. But we
want to make sure all the events are in the expected order and take
action based on the latest event. Will think through the approaches in
detail and update here.

609.3 Could you document which process makes the following transitions
DELETE_MARKED, DELETE_STARTED, DELETE_FINISHED?

Okay,  will document more details.

610. remote.log.reader.max.pending.tasks: "Maximum remote log reader thread
pool task queue size. If the task queue is full, broker will stop reading
remote log segments."  What does the broker do if the queue is full?

It returns an error for this topic partition.

611. What do we return if the request offset/epoch doesn't exist in the
following API?
    RemoteLogSegmentMetadata remoteLogSegmentMetadata(TopicPartition
topicPartition, long offset, int epochForOffset)

This returns null. But we prefer to update the return type as Optional
and return Empty if that does not exist.

Thanks,
Satish.

On Tue, Sep 1, 2020 at 9:45 AM Jun Rao <[email protected]> wrote:
>
> Hi, Satish,
>
> Thanks for the updated KIP. Made another pass. A few more comments below.
>
> 600. The topic deletion logic needs more details.
> 600.1 The KIP mentions "The controller considers the topic partition is
> deleted only when it determines that there are no log segments for that
> topic partition by using RLMM". How is this done?
> 600.2 "If the delete option is enabled then the leader will stop RLM task
> and stop processing and it sets all the remote log segment metadata of that
> partition with a delete marker and publishes them to RLMM." We discussed
> this earlier. When a topic is being deleted, there may not be a leader for
> the deleted partition.
>
> 601. Unclean leader election
> 601.1 Scenario 1: new empty follower
> After step 1, the follower restores up to offset 3. So why does it have
> LE-2 at offset 5?
> 601.2 senario 5: After Step 3, leader A has inconsistent data between its
> local and the tiered data. For example. offset 3 has msg 3 LE-0 locally,
> but msg 5 LE-1 in the remote store. While it's ok for the unclean leader to
> lose data, it should still return consistent data, whether it's from the
> local or the remote store.
> 601.3 The follower picks up log start offset using the following api.
> Suppose that we have 3 remote segments (LE, SegmentStartOffset) as (2, 10),
> (3, 20) and (7, 15) due to an unclean leader election. Using the following
> api will cause logStartOffset to go backward from 20 to 15. How do we
> prevent that?
>     earliestLogOffset(TopicPartition topicPartition, int leaderEpoch)
> 601.4 It seems that retention is based on
> listRemoteLogSegments(TopicPartition topicPartition, long leaderEpoch).
> When there is an unclean leader election, it's possible for the new leader
> to not to include certain epochs in its epoch cache. How are remote
> segments associated with those epochs being cleaned?
> 601.5 The KIP discusses the handling of unclean leader elections for user
> topics. What about unclean leader elections on
> __remote_log_segment_metadata?
>
> 602. It would be useful to clarify the limitations in the initial release.
> The KIP mentions not supporting compacted topics. What about JBOD and
> changing the configuration of a topic from delete to compact
> after remote.log.storage.enable is enabled?
>
> 603. RLM leader tasks:
> 603.1"It checks for rolled over LogSegments (which have the last message
> offset less than last stable offset of that topic partition) and copies
> them along with their offset/time/transaction indexes and leader epoch
> cache to the remote tier." It needs to copy the producer snapshot too.
> 603.2 "Local logs are not cleaned up till those segments are copied
> successfully to remote even though their retention time/size is reached"
> This seems weird. If the tiering stops because the remote store is not
> available, we don't want the local data to grow forever.
>
> 604. "RLM maintains a bounded cache(possibly LRU) of the index files of
> remote log segments to avoid multiple index fetches from the remote
> storage. These indexes can be used in the same way as local segment indexes
> are used." Could you provide more details on this? Are the indexes cached
> in memory or on disk? If on disk, where are they stored? Are the cached
> indexes bound by a certain size?
>
> 605. BuildingRemoteLogAux
> 605.1 In this section, two options are listed. Which one is chosen?
> 605.2 In option 2, it says  "Build the local leader epoch cache by cutting
> the leader epoch sequence received from remote storage to [LSO, ELO]. (LSO
> = log start offset)." We need to do the same thing for the producer
> snapshot. However, it's hard to cut the producer snapshot to an earlier
> offset. Another option is to simply take the lastOffset from the remote
> segment and use that as the starting fetch offset in the follower. This
> avoids the need for cutting.
>
> 606. ListOffsets: Since we need a version bump, could you document it under
> a protocol change section?
>
> 607. "LogStartOffset of a topic can point to either of local segment or
> remote segment but it is initialised and maintained in the Log class like
> now. This is already maintained in `Log` class while loading the logs and
> it can also be fetched from RemoteLogMetadataManager." What will happen to
> the existing logic (e.g. log recovery) that currently depends on
> logStartOffset but assumes it's local?
>
> 608. Handle expired remote segment: How does it pick up new logStartOffset
> from deleteRecords?
>
> 609. RLMM message format:
> 609.1 It includes both MaxTimestamp and EventTimestamp. Where does it get
> both since the message in the log only contains one timestamp?
> 609.2 If we change just the state (e.g. to DELETE_STARTED), it seems it's
> wasteful to have to include all other fields not changed.
> 609.3 Could you document which process makes the following transitions
> DELETE_MARKED, DELETE_STARTED, DELETE_FINISHED?
>
> 610. remote.log.reader.max.pending.tasks: "Maximum remote log reader thread
> pool task queue size. If the task queue is full, broker will stop reading
> remote log segments."  What does the broker do if the queue is full?
>
> 611. What do we return if the request offset/epoch doesn't exist in the
> following API?
>     RemoteLogSegmentMetadata remoteLogSegmentMetadata(TopicPartition
> topicPartition, long offset, int epochForOffset)
>
> Jun
>
>
>
> On Mon, Aug 31, 2020 at 11:19 AM Satish Duggana <[email protected]>
> wrote:
>
> > KIP is updated with
> > - Remote log segment metadata topic message format/schema.
> > - Added remote log segment metadata state transitions and explained
> > how the deletion of segments is handled, including the case of
> > partition deletions.
> > - Added a few more limitations in the "Non goals" section.
> >
> > Thanks,
> > Satish.
> >
> > On Thu, Aug 27, 2020 at 12:42 AM Harsha Ch <[email protected]> wrote:
> > >
> > > Updated the KIP with Meeting Notes section
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-MeetingNotes
> > >
> > > On Tue, Aug 25, 2020 at 1:03 PM Jun Rao <[email protected]> wrote:
> > >
> > > > Hi, Harsha,
> > > >
> > > > Thanks for the summary. Could you add the summary and the recording
> > link to
> > > > the last section of
> > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> > > > ?
> > > >
> > > > Jun
> > > >
> > > > On Tue, Aug 25, 2020 at 11:12 AM Harsha Chintalapani <[email protected]>
> > > > wrote:
> > > >
> > > > > Thanks everyone for attending the meeting today.
> > > > > Here is the recording
> > > > >
> > > > >
> > > >
> > https://drive.google.com/file/d/14PRM7U0OopOOrJR197VlqvRX5SXNtmKj/view?usp=sharing
> > > > >
> > > > > Notes:
> > > > >
> > > > >    1. KIP is updated with follower fetch protocol and ready to
> > reviewed
> > > > >    2. Satish to capture schema of internal metadata topic in the KIP
> > > > >    3. We will update the KIP with details of different cases
> > > > >    4. Test plan will be captured in a doc and will add to the KIP
> > > > >    5. Add a section "Limitations" to capture the capabilities that
> > will
> > > > be
> > > > >    introduced with this KIP and what will not be covered in this KIP.
> > > > >
> > > > > Please add to it I missed anything. Will produce a formal meeting
> > notes
> > > > > from next meeting onwards.
> > > > >
> > > > > Thanks,
> > > > > Harsha
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Aug 24, 2020 at 9:42 PM, Ying Zheng <[email protected]>
> > > > > wrote:
> > > > >
> > > > > > We did some basic feature tests at Uber. The test cases and
> > results are
> > > > > > shared in this google doc:
> > > > > > https://docs.google.com/spreadsheets/d/
> > > > > > 1XhNJqjzwXvMCcAOhEH0sSXU6RTvyoSf93DHF-YMfGLk/edit?usp=sharing
> > > > > >
> > > > > > The performance test results were already shared in the KIP last
> > month.
> > > > > >
> > > > > > On Mon, Aug 24, 2020 at 11:10 AM Harsha Ch <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > "Understand commitments towards driving design & implementation of
> > the
> > > > > KIP
> > > > > > further and how it aligns with participant interests in
> > contributing to
> > > > > the
> > > > > > efforts (ex: in the context of Uber’s Q3/Q4 roadmap)." What is that
> > > > > about?
> > > > > >
> > > > > > On Mon, Aug 24, 2020 at 11:05 AM Kowshik Prakasam <
> > > > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > Hi Harsha,
> > > > > >
> > > > > > The following google doc contains a proposal for temporary agenda
> > for
> > > > the
> > > > > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405> sync
> > meeting
> > > > > > tomorrow:
> > > > > >
> > > > > > https://docs.google.com/document/d/
> > > > > > 1pqo8X5LU8TpwfC_iqSuVPezhfCfhGkbGN2TqiPA3LBU/edit
> > > > > >
> > > > > > .
> > > > > > Please could you add it to the Google calendar invite?
> > > > > >
> > > > > > Thank you.
> > > > > >
> > > > > > Cheers,
> > > > > > Kowshik
> > > > > >
> > > > > > On Thu, Aug 20, 2020 at 10:58 AM Harsha Ch <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Scheduled a meeting for Tuesday 9am - 10am. I can record and
> > upload for
> > > > > > community to be able to follow the discussion.
> > > > > >
> > > > > > Jun, please add the required folks on confluent side.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Harsha
> > > > > >
> > > > > > On Thu, Aug 20, 2020 at 12:33 AM, Alexandre Dupriez <
> > > > alexandre.dupriez@
> > > > > > gmail.com > wrote:
> > > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Many thanks for your initiative.
> > > > > >
> > > > > > If you like, I am happy to attend at the time you suggested.
> > > > > >
> > > > > > Many thanks,
> > > > > > Alexandre
> > > > > >
> > > > > > Le mer. 19 août 2020 à 22:00, Harsha Ch < harsha. ch@ gmail. com (
> > > > > harsha.
> > > > > > [email protected] ) > a écrit :
> > > > > >
> > > > > > Hi Jun,
> > > > > > Thanks. This will help a lot. Tuesday will work for us.
> > > > > > -Harsha
> > > > > >
> > > > > > On Wed, Aug 19, 2020 at 1:24 PM Jun Rao < jun@ confluent. io (
> > jun@
> > > > > > confluent.io ) > wrote:
> > > > > >
> > > > > > Hi, Satish, Ying, Harsha,
> > > > > >
> > > > > > Do you think it would be useful to have a regular virtual meeting
> > to
> > > > > > discuss this KIP? The goal of the meeting will be sharing
> > > > > > design/development progress and discussing any open issues to
> > > > > >
> > > > > > accelerate
> > > > > >
> > > > > > this KIP. If so, will every Tuesday (from next week) 9am-10am
> > > > > >
> > > > > > PT
> > > > > >
> > > > > > work for you? I can help set up a Zoom meeting, invite everyone who
> > > > > >
> > > > > > might
> > > > > >
> > > > > > be interested, have it recorded and shared, etc.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Aug 18, 2020 at 11:01 AM Satish Duggana <
> > > > > >
> > > > > > satish. duggana@ gmail. com ( [email protected] ) >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi Kowshik,
> > > > > >
> > > > > > Thanks for looking into the KIP and sending your comments.
> > > > > >
> > > > > > 5001. Under the section "Follower fetch protocol in detail", the
> > > > > > next-local-offset is the offset upto which the segments are copied
> > > > > >
> > > > > > to
> > > > > >
> > > > > > remote storage. Instead, would last-tiered-offset be a better name
> > > > > >
> > > > > > than
> > > > > >
> > > > > > next-local-offset? last-tiered-offset seems to naturally align well
> > > > > >
> > > > > > with
> > > > > >
> > > > > > the definition provided in the KIP.
> > > > > >
> > > > > > Both next-local-offset and local-log-start-offset were introduced
> > > > > >
> > > > > > to
> > > > > >
> > > > > > talk
> > > > > >
> > > > > > about offsets related to local log. We are fine with
> > > > > >
> > > > > > last-tiered-offset
> > > > > >
> > > > > > too as you suggested.
> > > > > >
> > > > > > 5002. After leadership is established for a partition, the leader
> > > > > >
> > > > > > would
> > > > > >
> > > > > > begin uploading a segment to remote storage. If successful, the
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > would write the updated RemoteLogSegmentMetadata to the metadata
> > > > > >
> > > > > > topic
> > > > > >
> > > > > > (via
> > > > > >
> > > > > > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
> > > > > >
> > > > > > seems
> > > > > >
> > > > > > useful that before the first time the segment is uploaded by the
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > for
> > > > > >
> > > > > > a partition, the leader should ensure to catch up to all the
> > > > > >
> > > > > > metadata
> > > > > >
> > > > > > events written so far in the metadata topic for that partition (ex:
> > > > > >
> > > > > > by
> > > > > >
> > > > > > previous leader). To achieve this, the leader could start a lease
> > > > > >
> > > > > > (using
> > > > > >
> > > > > > an
> > > > > >
> > > > > > establish_leader metadata event) before commencing tiering, and
> > > > > >
> > > > > > wait
> > > > > >
> > > > > > until
> > > > > >
> > > > > > the event is read back. For example, this seems useful to avoid
> > > > > >
> > > > > > cases
> > > > > >
> > > > > > where
> > > > > >
> > > > > > zombie leaders can be active for the same partition. This can also
> > > > > >
> > > > > > prove
> > > > > >
> > > > > > useful to help avoid making decisions on which segments to be
> > > > > >
> > > > > > uploaded
> > > > > >
> > > > > > for
> > > > > >
> > > > > > a partition, until the current leader has caught up to a complete
> > > > > >
> > > > > > view
> > > > > >
> > > > > > of
> > > > > >
> > > > > > all segments uploaded for the partition so far (otherwise this may
> > > > > >
> > > > > > cause
> > > > > >
> > > > > > same segment being uploaded twice -- once by the previous leader
> > > > > >
> > > > > > and
> > > > > >
> > > > > > then
> > > > > >
> > > > > > by the new leader).
> > > > > >
> > > > > > We allow copying segments to remote storage which may have common
> > > > > >
> > > > > > offsets.
> > > > > >
> > > > > > Please go through the KIP to understand the follower fetch
> > > > > >
> > > > > > protocol(1) and
> > > > > >
> > > > > > follower to leader transition(2).
> > > > > >
> > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405
> > > > > >
> > > > > > <https://issues.apache.org/jira/browse/KIP-405>
> > > > > >
> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
> > > > > >
> > > > > > (
> > > > > >
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > >
> > > > >
> > > >
> > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-FollowerReplication
> > > > > >
> > > > > > )
> > > > > >
> > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405
> > > > > >
> > > > > > <https://issues.apache.org/jira/browse/KIP-405>
> > > > > >
> > > > > >
> > > > >
> > > >
> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > >
> > > > > > (
> > > > > >
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > >
> > > > >
> > > >
> > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > > )
> > > > > >
> > > > > > 5003. There is a natural interleaving between uploading a segment
> > > > > >
> > > > > > to
> > > > > >
> > > > > > remote
> > > > > >
> > > > > > store, and, writing a metadata event for the same (via
> > > > > > RLMM.putRemoteLogSegmentData). There can be cases where a remote
> > > > > >
> > > > > > segment
> > > > > >
> > > > > > is
> > > > > >
> > > > > > uploaded, then the leader fails and a corresponding metadata event
> > > > > >
> > > > > > never
> > > > > >
> > > > > > gets written. In such cases, the orphaned remote segment has to be
> > > > > > eventually deleted (since there is no confirmation of the upload).
> > > > > >
> > > > > > To
> > > > > >
> > > > > > handle this, we could use 2 separate metadata events viz.
> > > > > >
> > > > > > copy_initiated
> > > > > >
> > > > > > and copy_completed, so that copy_initiated events that don't have a
> > > > > > corresponding copy_completed event can be treated as garbage and
> > > > > >
> > > > > > deleted
> > > > > >
> > > > > > from the remote object store by the broker.
> > > > > >
> > > > > > We are already updating RMM with RemoteLogSegmentMetadata pre and
> > > > > >
> > > > > > post
> > > > > >
> > > > > > copying of log segments. We had a flag in RemoteLogSegmentMetadata
> > > > > >
> > > > > > whether
> > > > > >
> > > > > > it is copied or not. But we are making changes in
> > > > > >
> > > > > > RemoteLogSegmentMetadata
> > > > > >
> > > > > > to introduce a state field in RemoteLogSegmentMetadata which will
> > > > > >
> > > > > > have the
> > > > > >
> > > > > > respective started and finished states. This includes for other
> > > > > >
> > > > > > operations
> > > > > >
> > > > > > like delete too.
> > > > > >
> > > > > > 5004. In the default implementation of RLMM (using the internal
> > > > > >
> > > > > > topic
> > > > > >
> > > > > > __remote_log_metadata), a separate topic called
> > > > > > __remote_segments_to_be_deleted is going to be used just to track
> > > > > >
> > > > > > failures
> > > > > >
> > > > > > in removing remote log segments. A separate topic (effectively
> > > > > >
> > > > > > another
> > > > > >
> > > > > > metadata stream) introduces some maintenance overhead and design
> > > > > > complexity. It seems to me that the same can be achieved just by
> > > > > >
> > > > > > using
> > > > > >
> > > > > > just
> > > > > >
> > > > > > the __remote_log_metadata topic with the following steps: 1) the
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > writes a delete_initiated metadata event, 2) the leader deletes the
> > > > > >
> > > > > > segment
> > > > > >
> > > > > > and 3) the leader writes a delete_completed metadata event. Tiered
> > > > > >
> > > > > > segments
> > > > > >
> > > > > > that have delete_initiated message and not delete_completed
> > > > > >
> > > > > > message,
> > > > > >
> > > > > > can
> > > > > >
> > > > > > be
> > > > > >
> > > > > > considered to be a failure and retried.
> > > > > >
> > > > > > Jun suggested in earlier mail to keep this simple . We decided not
> > > > > >
> > > > > > to have
> > > > > >
> > > > > > this topic as mentioned in our earlier replies, updated the KIP.
> > > > > >
> > > > > > As I
> > > > > >
> > > > > > mentioned in an earlier comment, we are adding state entries for
> > > > > >
> > > > > > delete
> > > > > >
> > > > > > operations too.
> > > > > >
> > > > > > 5005. When a Kafka cluster is provisioned for the first time with
> > > > > >
> > > > > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405>
> > > > > >
> > > > > > tiered storage enabled, could you explain in the KIP about how the
> > > > > > bootstrap for __remote_log_metadata topic will be performed in the
> > > > > >
> > > > > > the
> > > > > >
> > > > > > default RLMM implementation?
> > > > > >
> > > > > > __remote_log_segment_metadata topic is created by default with the
> > > > > > respective topic like partitions/replication-factor etc. Can you be
> > > > > >
> > > > > > more
> > > > > >
> > > > > > specific on what you are looking for?
> > > > > >
> > > > > > 5008. The system-wide configuration ' remote. log. storage. enable
> > > > > >
> > > > > > (
> > > > > >
> > > > > > http://remote.log.storage.enable/ ) ' is used
> > > > > >
> > > > > > to
> > > > > >
> > > > > > enable tiered storage. Can this be made a topic-level
> > > > > >
> > > > > > configuration,
> > > > > >
> > > > > > so
> > > > > >
> > > > > > that the user can enable/disable tiered storage at a topic level
> > > > > >
> > > > > > rather
> > > > > >
> > > > > > than a system-wide default for an entire Kafka cluster?
> > > > > >
> > > > > > Yes, we mentioned in an earlier mail thread that it will be
> > > > > >
> > > > > > supported at
> > > > > >
> > > > > > topic level too, updated the KIP.
> > > > > >
> > > > > > 5009. Whenever a topic with tiered storage enabled is deleted, the
> > > > > > underlying actions require the topic data to be deleted in local
> > > > > >
> > > > > > store
> > > > > >
> > > > > > as
> > > > > >
> > > > > > well as remote store, and eventually the topic metadata needs to be
> > > > > >
> > > > > > deleted
> > > > > >
> > > > > > too. What is the role of the controller in deleting a topic and
> > > > > >
> > > > > > it's
> > > > > >
> > > > > > contents, while the topic has tiered storage enabled?
> > > > > >
> > > > > > When a topic partition is deleted, there will be an event for that
> > > > > >
> > > > > > in RLMM
> > > > > >
> > > > > > for its deletion and the controller considers that topic is deleted
> > > > > >
> > > > > > only
> > > > > >
> > > > > > when all the remote log segments are also deleted.
> > > > > >
> > > > > > 5010. RLMM APIs are currently synchronous, for example
> > > > > > RLMM.putRemoteLogSegmentData waits until the put operation is
> > > > > >
> > > > > > completed
> > > > > >
> > > > > > in
> > > > > >
> > > > > > the remote metadata store. It may also block until the leader has
> > > > > >
> > > > > > caught
> > > > > >
> > > > > > up
> > > > > >
> > > > > > to the metadata (not sure). Could we make these apis asynchronous
> > > > > >
> > > > > > (ex:
> > > > > >
> > > > > > based on java.util.concurrent.Future) to provide room for tapping
> > > > > > performance improvements such as non-blocking i/o? 5011. The same
> > > > > >
> > > > > > question
> > > > > >
> > > > > > as 5009 on sync vs async api for RSM. Have we considered the
> > > > > >
> > > > > > pros/cons of
> > > > > >
> > > > > > making the RSM apis asynchronous?
> > > > > >
> > > > > > Async methods are used to do other tasks while the result is not
> > > > > > available. In this case, we need to have the result before
> > > > > >
> > > > > > proceeding to
> > > > > >
> > > > > > take next actions. These APIs are evolving and these can be updated
> > > > > >
> > > > > > as and
> > > > > >
> > > > > > when needed instead of having them as asynchronous now.
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Fri, Aug 14, 2020 at 4:30 AM Kowshik Prakasam <
> > > > > >
> > > > > > kprakasam@ confluent. io ( [email protected] )
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi Harsha/Satish,
> > > > > >
> > > > > > Thanks for the great KIP. Below are the first set of
> > > > > >
> > > > > > questions/suggestions
> > > > > >
> > > > > > I had after making a pass on the KIP.
> > > > > >
> > > > > > 5001. Under the section "Follower fetch protocol in detail", the
> > > > > > next-local-offset is the offset upto which the segments are copied
> > > > > >
> > > > > > to
> > > > > >
> > > > > > remote storage. Instead, would last-tiered-offset be a better name
> > > > > >
> > > > > > than
> > > > > >
> > > > > > next-local-offset? last-tiered-offset seems to naturally align
> > > > > >
> > > > > > well
> > > > > >
> > > > > > with
> > > > > >
> > > > > > the definition provided in the KIP.
> > > > > >
> > > > > > 5002. After leadership is established for a partition, the leader
> > > > > >
> > > > > > would
> > > > > >
> > > > > > begin uploading a segment to remote storage. If successful, the
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > would write the updated RemoteLogSegmentMetadata to the metadata
> > > > > >
> > > > > > topic
> > > > > >
> > > > > > (via
> > > > > >
> > > > > > RLMM.putRemoteLogSegmentData). However, for defensive reasons, it
> > > > > >
> > > > > > seems
> > > > > >
> > > > > > useful that before the first time the segment is uploaded by the
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > for
> > > > > >
> > > > > > a partition, the leader should ensure to catch up to all the
> > > > > >
> > > > > > metadata
> > > > > >
> > > > > > events written so far in the metadata topic for that partition
> > > > > >
> > > > > > (ex:
> > > > > >
> > > > > > by
> > > > > >
> > > > > > previous leader). To achieve this, the leader could start a lease
> > > > > >
> > > > > > (using
> > > > > >
> > > > > > an
> > > > > >
> > > > > > establish_leader metadata event) before commencing tiering, and
> > > > > >
> > > > > > wait
> > > > > >
> > > > > > until
> > > > > >
> > > > > > the event is read back. For example, this seems useful to avoid
> > > > > >
> > > > > > cases
> > > > > >
> > > > > > where
> > > > > >
> > > > > > zombie leaders can be active for the same partition. This can also
> > > > > >
> > > > > > prove
> > > > > >
> > > > > > useful to help avoid making decisions on which segments to be
> > > > > >
> > > > > > uploaded
> > > > > >
> > > > > > for
> > > > > >
> > > > > > a partition, until the current leader has caught up to a complete
> > > > > >
> > > > > > view
> > > > > >
> > > > > > of
> > > > > >
> > > > > > all segments uploaded for the partition so far (otherwise this may
> > > > > >
> > > > > > cause
> > > > > >
> > > > > > same segment being uploaded twice -- once by the previous leader
> > > > > >
> > > > > > and
> > > > > >
> > > > > > then
> > > > > >
> > > > > > by the new leader).
> > > > > >
> > > > > > 5003. There is a natural interleaving between uploading a segment
> > > > > >
> > > > > > to
> > > > > >
> > > > > > remote
> > > > > >
> > > > > > store, and, writing a metadata event for the same (via
> > > > > > RLMM.putRemoteLogSegmentData). There can be cases where a remote
> > > > > >
> > > > > > segment
> > > > > >
> > > > > > is
> > > > > >
> > > > > > uploaded, then the leader fails and a corresponding metadata event
> > > > > >
> > > > > > never
> > > > > >
> > > > > > gets written. In such cases, the orphaned remote segment has to be
> > > > > > eventually deleted (since there is no confirmation of the upload).
> > > > > >
> > > > > > To
> > > > > >
> > > > > > handle this, we could use 2 separate metadata events viz.
> > > > > >
> > > > > > copy_initiated
> > > > > >
> > > > > > and copy_completed, so that copy_initiated events that don't have
> > > > > >
> > > > > > a
> > > > > >
> > > > > > corresponding copy_completed event can be treated as garbage and
> > > > > >
> > > > > > deleted
> > > > > >
> > > > > > from the remote object store by the broker.
> > > > > >
> > > > > > 5004. In the default implementation of RLMM (using the internal
> > > > > >
> > > > > > topic
> > > > > >
> > > > > > __remote_log_metadata), a separate topic called
> > > > > > __remote_segments_to_be_deleted is going to be used just to track
> > > > > >
> > > > > > failures
> > > > > >
> > > > > > in removing remote log segments. A separate topic (effectively
> > > > > >
> > > > > > another
> > > > > >
> > > > > > metadata stream) introduces some maintenance overhead and design
> > > > > > complexity. It seems to me that the same can be achieved just by
> > > > > >
> > > > > > using
> > > > > >
> > > > > > just
> > > > > >
> > > > > > the __remote_log_metadata topic with the following steps: 1) the
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > writes a delete_initiated metadata event, 2) the leader deletes
> > > > > >
> > > > > > the
> > > > > >
> > > > > > segment
> > > > > >
> > > > > > and 3) the leader writes a delete_completed metadata event. Tiered
> > > > > >
> > > > > > segments
> > > > > >
> > > > > > that have delete_initiated message and not delete_completed
> > > > > >
> > > > > > message,
> > > > > >
> > > > > > can
> > > > > >
> > > > > > be
> > > > > >
> > > > > > considered to be a failure and retried.
> > > > > >
> > > > > > 5005. When a Kafka cluster is provisioned for the first time with
> > > > > >
> > > > > > KIP-405 <https://issues.apache.org/jira/browse/KIP-405>
> > > > > >
> > > > > > tiered storage enabled, could you explain in the KIP about how the
> > > > > > bootstrap for __remote_log_metadata topic will be performed in the
> > > > > >
> > > > > > the
> > > > > >
> > > > > > default RLMM implementation?
> > > > > >
> > > > > > 5006. I currently do not see details on the KIP on why RocksDB was
> > > > > >
> > > > > > chosen
> > > > > >
> > > > > > as the default cache implementation, and how it is going to be
> > > > > >
> > > > > > used.
> > > > > >
> > > > > > Were
> > > > > >
> > > > > > alternatives compared/considered? For example, it would be useful
> > > > > >
> > > > > > to
> > > > > >
> > > > > > explain/evaulate the following: 1) debuggability of the RocksDB
> > > > > >
> > > > > > JNI
> > > > > >
> > > > > > interface, 2) performance, 3) portability across platforms and 4)
> > > > > >
> > > > > > interface
> > > > > >
> > > > > > parity of RocksDB’s JNI api with it's underlying C/C++ api.
> > > > > >
> > > > > > 5007. For the RocksDB cache (the default implementation of RLMM),
> > > > > >
> > > > > > what
> > > > > >
> > > > > > is
> > > > > >
> > > > > > the relationship/mapping between the following: 1) # of tiered
> > > > > >
> > > > > > partitions,
> > > > > >
> > > > > > 2) # of partitions of metadata topic __remote_log_metadata and 3)
> > > > > >
> > > > > > #
> > > > > >
> > > > > > of
> > > > > >
> > > > > > RocksDB instances? i.e. is the plan to have a RocksDB instance per
> > > > > >
> > > > > > tiered
> > > > > >
> > > > > > partition, or per metadata topic partition, or just 1 for per
> > > > > >
> > > > > > broker?
> > > > > >
> > > > > > 5008. The system-wide configuration ' remote. log. storage.
> > > > > >
> > > > > > enable (
> > > > > >
> > > > > > http://remote.log.storage.enable/ ) ' is
> > > > > >
> > > > > > used
> > > > > >
> > > > > > to
> > > > > >
> > > > > > enable tiered storage. Can this be made a topic-level
> > > > > >
> > > > > > configuration,
> > > > > >
> > > > > > so
> > > > > >
> > > > > > that the user can enable/disable tiered storage at a topic level
> > > > > >
> > > > > > rather
> > > > > >
> > > > > > than a system-wide default for an entire Kafka cluster?
> > > > > >
> > > > > > 5009. Whenever a topic with tiered storage enabled is deleted, the
> > > > > > underlying actions require the topic data to be deleted in local
> > > > > >
> > > > > > store
> > > > > >
> > > > > > as
> > > > > >
> > > > > > well as remote store, and eventually the topic metadata needs to
> > > > > >
> > > > > > be
> > > > > >
> > > > > > deleted
> > > > > >
> > > > > > too. What is the role of the controller in deleting a topic and
> > > > > >
> > > > > > it's
> > > > > >
> > > > > > contents, while the topic has tiered storage enabled?
> > > > > >
> > > > > > 5010. RLMM APIs are currently synchronous, for example
> > > > > > RLMM.putRemoteLogSegmentData waits until the put operation is
> > > > > >
> > > > > > completed
> > > > > >
> > > > > > in
> > > > > >
> > > > > > the remote metadata store. It may also block until the leader has
> > > > > >
> > > > > > caught
> > > > > >
> > > > > > up
> > > > > >
> > > > > > to the metadata (not sure). Could we make these apis asynchronous
> > > > > >
> > > > > > (ex:
> > > > > >
> > > > > > based on java.util.concurrent.Future) to provide room for tapping
> > > > > > performance improvements such as non-blocking i/o?
> > > > > >
> > > > > > 5011. The same question as 5009 on sync vs async api for RSM. Have
> > > > > >
> > > > > > we
> > > > > >
> > > > > > considered the pros/cons of making the RSM apis asynchronous?
> > > > > >
> > > > > > Cheers,
> > > > > > Kowshik
> > > > > >
> > > > > > On Thu, Aug 6, 2020 at 11:02 AM Satish Duggana <
> > > > > >
> > > > > > satish. duggana@ gmail. com ( [email protected] )
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi Jun,
> > > > > > Thanks for your comments.
> > > > > >
> > > > > > At the high level, that approach sounds reasonable to
> > > > > >
> > > > > > me. It would be useful to document how RLMM handles overlapping
> > > > > >
> > > > > > archived
> > > > > >
> > > > > > offset ranges and how those overlapping segments are deleted
> > > > > >
> > > > > > through
> > > > > >
> > > > > > retention.
> > > > > >
> > > > > > Sure, we will document that in the KIP.
> > > > > >
> > > > > > How is the remaining part of the KIP coming along? To me, the
> > > > > >
> > > > > > two
> > > > > >
> > > > > > biggest
> > > > > >
> > > > > > missing items are (1) more detailed documentation on how all the
> > > > > >
> > > > > > new
> > > > > >
> > > > > > APIs
> > > > > >
> > > > > > are being used and (2) metadata format and usage in the internal
> > > > > >
> > > > > > topic
> > > > > >
> > > > > > __remote_log_metadata.
> > > > > >
> > > > > > We are working on updating APIs based on the recent discussions
> > > > > >
> > > > > > and get
> > > > > >
> > > > > > the perf numbers by plugging in rocksdb as a cache store for
> > > > > >
> > > > > > RLMM.
> > > > > >
> > > > > > We will update the KIP with the updated APIs and with the above
> > > > > >
> > > > > > requested
> > > > > >
> > > > > > details in a few days and let you know.
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Wed, Aug 5, 2020 at 12:49 AM Jun Rao < jun@ confluent. io (
> > jun@
> > > > > > confluent.io ) > wrote:
> > > > > >
> > > > > > Hi, Ying, Satish,
> > > > > >
> > > > > > Thanks for the reply. At the high level, that approach sounds
> > > > > >
> > > > > > reasonable
> > > > > >
> > > > > > to
> > > > > >
> > > > > > me. It would be useful to document how RLMM handles overlapping
> > > > > >
> > > > > > archived
> > > > > >
> > > > > > offset ranges and how those overlapping segments are deleted
> > > > > >
> > > > > > through
> > > > > >
> > > > > > retention.
> > > > > >
> > > > > > How is the remaining part of the KIP coming along? To me, the
> > > > > >
> > > > > > two
> > > > > >
> > > > > > biggest
> > > > > >
> > > > > > missing items are (1) more detailed documentation on how all the
> > > > > >
> > > > > > new
> > > > > >
> > > > > > APIs
> > > > > >
> > > > > > are being used and (2) metadata format and usage in the internal
> > > > > >
> > > > > > topic
> > > > > >
> > > > > > __remote_log_metadata.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Aug 4, 2020 at 8:32 AM Satish Duggana <
> > > > > >
> > > > > > satish. duggana@ gmail. com ( [email protected] ) >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi Jun,
> > > > > > Thanks for your comment,
> > > > > >
> > > > > > 1001. Using the new leader as the source of truth may be fine
> > > > > >
> > > > > > too.
> > > > > >
> > > > > > What's
> > > > > >
> > > > > > not clear to me is when a follower takes over as the new
> > > > > >
> > > > > > leader,
> > > > > >
> > > > > > from
> > > > > >
> > > > > > which
> > > > > >
> > > > > > offset does it start archiving to the block storage. I assume
> > > > > >
> > > > > > that
> > > > > >
> > > > > > the
> > > > > >
> > > > > > new
> > > > > >
> > > > > > leader starts from the latest archived ooffset by the previous
> > > > > >
> > > > > > leader,
> > > > > >
> > > > > > but
> > > > > >
> > > > > > it seems that's not the case. It would be useful to document
> > > > > >
> > > > > > this
> > > > > >
> > > > > > in
> > > > > >
> > > > > > the
> > > > > >
> > > > > > Wiki.
> > > > > >
> > > > > > When a follower becomes a leader it needs to findout the offset
> > > > > >
> > > > > > from
> > > > > >
> > > > > > which the segments to be copied to remote storage. This is
> > > > > >
> > > > > > found
> > > > > >
> > > > > > by
> > > > > >
> > > > > > traversing from the the latest leader epoch from leader epoch
> > > > > >
> > > > > > history
> > > > > >
> > > > > > and find the highest offset of a segment with that epoch copied
> > > > > >
> > > > > > into
> > > > > >
> > > > > > remote storage by using respective RLMM APIs. If it can not
> > > > > >
> > > > > > find
> > > > > >
> > > > > > an
> > > > > >
> > > > > > entry then it checks for the previous leader epoch till it
> > > > > >
> > > > > > finds
> > > > > >
> > > > > > an
> > > > > >
> > > > > > entry, If there are no entries till the earliest leader epoch
> > > > > >
> > > > > > in
> > > > > >
> > > > > > leader epoch cache then it starts copying the segments from the
> > > > > >
> > > > > > earliest
> > > > > >
> > > > > > epoch entry’s offset.
> > > > > > Added an example in the KIP here[1]. We will update RLMM APIs
> > > > > >
> > > > > > in
> > > > > >
> > > > > > the
> > > > > >
> > > > > > KIP.
> > > > > >
> > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405
> > > > > >
> > > > > > <https://issues.apache.org/jira/browse/KIP-405>
> > > > > >
> > > > > >
> > > > >
> > > >
> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > >
> > > > > > (
> > > > > >
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > >
> > > > >
> > > >
> > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > > )
> > > > > >
> > > > > > Satish.
> > > > > >
> > > > > > On Tue, Aug 4, 2020 at 9:00 PM Satish Duggana <
> > > > > >
> > > > > > satish. duggana@ gmail. com ( [email protected] ) >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi Ying,
> > > > > > Thanks for your comment.
> > > > > >
> > > > > > 1001. Using the new leader as the source of truth may be fine
> > > > > >
> > > > > > too.
> > > > > >
> > > > > > What's
> > > > > >
> > > > > > not clear to me is when a follower takes over as the new
> > > > > >
> > > > > > leader,
> > > > > >
> > > > > > from
> > > > > >
> > > > > > which
> > > > > >
> > > > > > offset does it start archiving to the block storage. I assume
> > > > > >
> > > > > > that
> > > > > >
> > > > > > the
> > > > > >
> > > > > > new
> > > > > >
> > > > > > leader starts from the latest archived ooffset by the
> > > > > >
> > > > > > previous
> > > > > >
> > > > > > leader,
> > > > > >
> > > > > > but
> > > > > >
> > > > > > it seems that's not the case. It would be useful to document
> > > > > >
> > > > > > this in
> > > > > >
> > > > > > the
> > > > > >
> > > > > > Wiki.
> > > > > >
> > > > > > When a follower becomes a leader it needs to findout the
> > > > > >
> > > > > > offset
> > > > > >
> > > > > > from
> > > > > >
> > > > > > which the segments to be copied to remote storage. This is
> > > > > >
> > > > > > found
> > > > > >
> > > > > > by
> > > > > >
> > > > > > traversing from the the latest leader epoch from leader epoch
> > > > > >
> > > > > > history
> > > > > >
> > > > > > and find the highest offset of a segment with that epoch
> > > > > >
> > > > > > copied
> > > > > >
> > > > > > into
> > > > > >
> > > > > > remote storage by using respective RLMM APIs. If it can not
> > > > > >
> > > > > > find
> > > > > >
> > > > > > an
> > > > > >
> > > > > > entry then it checks for the previous leader epoch till it
> > > > > >
> > > > > > finds
> > > > > >
> > > > > > an
> > > > > >
> > > > > > entry, If there are no entries till the earliest leader epoch
> > > > > >
> > > > > > in
> > > > > >
> > > > > > leader epoch cache then it starts copying the segments from
> > > > > >
> > > > > > the
> > > > > >
> > > > > > earliest epoch entry’s offset.
> > > > > > Added an example in the KIP here[1]. We will update RLMM APIs
> > > > > >
> > > > > > in
> > > > > >
> > > > > > the
> > > > > >
> > > > > > KIP.
> > > > > >
> > > > > > https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-405
> > > > > >
> > > > > > <https://issues.apache.org/jira/browse/KIP-405>
> > > > > >
> > > > > >
> > > > >
> > > >
> > %3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > >
> > > > > > (
> > > > > >
> > > > > > https://cwiki.apache.org/confluence/display/KAFKA/
> > > > > >
> > > > >
> > > >
> > KIP-405%3A+Kafka+Tiered+Storage#KIP405:KafkaTieredStorage-Followertoleadertransition
> > > > > >
> > > > > > )
> > > > > >
> > > > > > Satish.
> > > > > >
> > > > > > On Tue, Aug 4, 2020 at 10:28 AM Ying Zheng
> > > > > >
> > > > > > < yingz@ uber. com. invalid ( [email protected] ) >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Thank you for the comment! The current KIP is not very
> > > > > >
> > > > > > clear
> > > > > >
> > > > > > about
> > > > > >
> > > > > > this
> > > > > >
> > > > > > part.
> > > > > >
> > > > > > 1001. The new leader will start archiving from the earliest
> > > > > >
> > > > > > local
> > > > > >
> > > > > > segment
> > > > > >
> > > > > > that is not fully
> > > > > > covered by the "valid" remote data. "valid" means the
> > > > > >
> > > > > > (offset,
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > epoch) pair is valid
> > > > > > based on the leader-epoch history.
> > > > > >
> > > > > > There are some edge cases where the same offset range (with
> > > > > >
> > > > > > the
> > > > > >
> > > > > > same
> > > > > >
> > > > > > leader
> > > > > >
> > > > > > epoch) can
> > > > > > be copied to the remote storage more than once. But this
> > > > > >
> > > > > > kind
> > > > > >
> > > > > > of
> > > > > >
> > > > > > duplication shouldn't be a
> > > > > > problem.
> > > > > >
> > > > > > Staish is going to explain the details in the KIP with
> > > > > >
> > > > > > examples.
> > > > > >
> > > > > > On Fri, Jul 31, 2020 at 2:55 PM Jun Rao < jun@ confluent.
> > > > > >
> > > > > > io (
> > > > > >
> > > > > > [email protected] ) >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > Hi, Ying,
> > > > > >
> > > > > > Thanks for the reply.
> > > > > >
> > > > > > 1001. Using the new leader as the source of truth may be
> > > > > >
> > > > > > fine
> > > > > >
> > > > > > too.
> > > > > >
> > > > > > What's
> > > > > >
> > > > > > not clear to me is when a follower takes over as the new
> > > > > >
> > > > > > leader,
> > > > > >
> > > > > > from which
> > > > > >
> > > > > > offset does it start archiving to the block storage. I
> > > > > >
> > > > > > assume
> > > > > >
> > > > > > that
> > > > > >
> > > > > > the new
> > > > > >
> > > > > > leader starts from the latest archived ooffset by the
> > > > > >
> > > > > > previous
> > > > > >
> > > > > > leader, but
> > > > > >
> > > > > > it seems that's not the case. It would be useful to
> > > > > >
> > > > > > document
> > > > > >
> > > > > > this in
> > > > > >
> > > > > > the
> > > > > >
> > > > > > wiki.
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Tue, Jul 28, 2020 at 12:11 PM Ying Zheng
> > > > > >
> > > > > > < yingz@ uber. com. invalid ( [email protected] ) >
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > 1001.
> > > > > >
> > > > > > We did consider this approach. The concerns are
> > > > > > 1) This makes unclean-leader-election rely on remote
> > > > > >
> > > > > > storage.
> > > > > >
> > > > > > In
> > > > > >
> > > > > > case
> > > > > >
> > > > > > the
> > > > > >
> > > > > > remote storage
> > > > > > is unavailable, Kafka will not be able to finish the
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> >

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to