Hi Jun,
Please go through the wiki which has the latest updates. Google doc is
updated frequently to be in sync with wiki.

Thanks,
Satish.

On Tue, May 19, 2020 at 12:30 AM Jun Rao <j...@confluent.io> wrote:

> Hi, Satish,
>
> Thanks for the update. Just to clarify. Which doc has the latest updates,
> the wiki or the google doc?
>
> Jun
>
> On Thu, May 14, 2020 at 10:38 AM Satish Duggana <satish.dugg...@gmail.com>
> wrote:
>
>> Hi Jun,
>> Thanks for your comments.  We updated the KIP with more details.
>>
>> >100. For each of the operations related to tiering, it would be useful
>> to provide a description on how it works with the new API. These include
>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>> (remote and local) by size, time and logStartOffset, topic deletion, etc.
>> This will tell us if the proposed APIs are sufficient.
>>
>> We addressed most of these APIs in the KIP. We can add more details if
>> needed.
>>
>> >101. For the default implementation based on internal topic, is it meant
>> as a proof of concept or for production usage? I assume that it's the
>> former. However, if it's the latter, then the KIP needs to describe the
>> design in more detail.
>>
>> It is production usage as was mentioned in an earlier mail. We plan to
>> update this section in the next few days.
>>
>> >102. When tiering a segment, the segment is first written to the object
>> store and then its metadata is written to RLMM using the api "void 
>> putRemoteLogSegmentData()".
>> One potential issue with this approach is that if the system fails after
>> the first operation, it leaves a garbage in the object store that's never
>> reclaimed. One way to improve this is to have two separate APIs, sth like
>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>
>> That is a good point. We currently have a different way using markers in
>> the segment but your suggestion is much better.
>>
>> >103. It seems that the transactional support and the ability to read
>> from follower are missing.
>>
>> KIP is updated with transactional support, follower fetch semantics, and
>> reading from a follower.
>>
>> >104. It would be useful to provide a testing plan for this KIP.
>>
>> We added a few tests by introducing test util for tiered storage in the
>> PR. We will provide the testing plan in the next few days.
>>
>> Thanks,
>> Satish.
>>
>>
>> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani <ka...@harsha.io>
>> wrote:
>>
>>>
>>>
>>>
>>>
>>> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao <j...@confluent.io> wrote:
>>>
>>>> Hi, Satish,
>>>>
>>>> Thanks for the updated doc. The new API seems to be an improvement
>>>> overall. A few more comments below.
>>>>
>>>> 100. For each of the operations related to tiering, it would be useful
>>>> to provide a description on how it works with the new API. These include
>>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>>>> (remote and local) by size, time and logStartOffset, topic deletion,
>>>> etc. This will tell us if the proposed APIs are sufficient.
>>>>
>>>
>>> Thanks for the feedback Jun. We will add more details around this.
>>>
>>>
>>>> 101. For the default implementation based on internal topic, is it
>>>> meant as a proof of concept or for production usage? I assume that it's the
>>>> former. However, if it's the latter, then the KIP needs to describe the
>>>> design in more detail.
>>>>
>>>
>>> Yes it meant to be for production use.  Ideally it would be good to
>>> merge this in as the default implementation for metadata service. We can
>>> add more details around design and testing.
>>>
>>> 102. When tiering a segment, the segment is first written to the object
>>>> store and then its metadata is written to RLMM using the api "void
>>>> putRemoteLogSegmentData()".
>>>> One potential issue with this approach is that if the system fails
>>>> after the first operation, it leaves a garbage in the object store that's
>>>> never reclaimed. One way to improve this is to have two separate APIs, sth
>>>> like preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>>>
>>>> 103. It seems that the transactional support and the ability to read
>>>> from follower are missing.
>>>>
>>>> 104. It would be useful to provide a testing plan for this KIP.
>>>>
>>>
>>> We are working on adding more details around transactional support and
>>> coming up with test plan.
>>> Add system tests and integration tests.
>>>
>>> Thanks,
>>>>
>>>> Jun
>>>>
>>>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana <
>>>> satish.dugg...@gmail.com> wrote:
>>>>
>>>> Hi Jun,
>>>> Please look at the earlier reply and let us know your comments.
>>>>
>>>> Thanks,
>>>> Satish.
>>>>
>>>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana <
>>>> satish.dugg...@gmail.com> wrote:
>>>>
>>>> Hi Jun,
>>>> Thanks for your comments on the separation of remote log metadata
>>>> storage and remote log storage.
>>>> We had a few discussions since early Jan on how to support eventually
>>>> consistent stores like S3 by uncoupling remote log segment metadata and
>>>> remote log storage. It is written with details in the doc here(1). Below is
>>>> the brief summary of the discussion from that doc.
>>>>
>>>> The current approach consists of pulling the remote log segment
>>>> metadata from remote log storage APIs. It worked fine for storages like
>>>> HDFS. But one of the problems of relying on the remote storage to maintain
>>>> metadata is that tiered-storage needs to be strongly consistent, with an
>>>> impact not only on the metadata(e.g. LIST in S3) but also on the segment
>>>> data(e.g. GET after a DELETE in S3). The cost of maintaining metadata in
>>>> remote storage needs to be factored in. This is true in the case of S3,
>>>> LIST APIs incur huge costs as you raised earlier.
>>>> So, it is good to separate the remote storage from the remote log
>>>> metadata store. We refactored the existing RemoteStorageManager and
>>>> introduced RemoteLogMetadataManager. Remote log metadata store should give
>>>> strong consistency semantics but remote log storage can be eventually
>>>> consistent.
>>>> We can have a default implementation for RemoteLogMetadataManager which
>>>> uses an internal topic(as mentioned in one of our earlier emails) as
>>>> storage. But users can always plugin their own RemoteLogMetadataManager
>>>> implementation based on their environment.
>>>>
>>>> Please go through the updated KIP and let us know your comments. We
>>>> have started refactoring for the changes mentioned in the KIP and there may
>>>> be a few more updates to the APIs.
>>>>
>>>> [1]
>>>>
>>>>
>>>> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
>>>>
>>>> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko <
>>>> ivan0yurche...@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> Jun:
>>>>
>>>> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If
>>>>
>>>> you
>>>>
>>>> have 100,000 partitions and want to pull the metadata for each
>>>>
>>>> partition
>>>>
>>>> at
>>>>
>>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per
>>>>
>>>> day.
>>>>
>>>> I want to note here, that no reasonably durable storage will be cheap
>>>> at 100k RPS. For example, DynamoDB might give the same ballpark
>>>>
>>>> figures.
>>>>
>>>> If we want to keep the pull-based approach, we can try to reduce this
>>>>
>>>> number
>>>>
>>>> in several ways: doing listings less frequently (as Satish mentioned,
>>>> with the current defaults it's ~3.33k RPS for your example), batching
>>>> listing operations in some way (depending on the storage; it might require
>>>> the change of RSM's interface).
>>>>
>>>> There are different ways for doing push based metadata propagation.
>>>>
>>>> Some
>>>>
>>>> object stores may support that already. For example, S3 supports
>>>>
>>>> events
>>>>
>>>> notification
>>>>
>>>> This sounds interesting. However, I see a couple of issues using it:
>>>> 1. As I understand the documentation, notification delivery is not
>>>> guaranteed
>>>> and it's recommended to periodically do LIST to fill the gaps. Which
>>>> brings us back to the same LIST consistency guarantees issue.
>>>> 2. The same goes for the broker start: to get the current state, we
>>>>
>>>> need
>>>>
>>>> to LIST.
>>>> 3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS
>>>>
>>>> aren't
>>>>
>>>> designed for such a case.
>>>>
>>>> Alexandre:
>>>>
>>>> A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM
>>>>
>>>> cannot
>>>>
>>>> relies solely on S3 APIs to guarantee the expected strong
>>>>
>>>> consistency. The
>>>>
>>>> proposed implementation [3] would need to be updated to take this
>>>>
>>>> into
>>>>
>>>> account. Let’s talk more about this.
>>>>
>>>> Thank you for the feedback. I clearly see the need for changing the S3
>>>> implementation
>>>> to provide stronger consistency guarantees. As it see from this thread,
>>>> there are
>>>> several possible approaches to this. Let's discuss RemoteLogManager's
>>>> contract and
>>>> behavior (like pull vs push model) further before picking one (or
>>>>
>>>> several -
>>>>
>>>> ?) of them.
>>>> I'm going to do some evaluation of DynamoDB for the pull-based
>>>>
>>>> approach,
>>>>
>>>> if it's possible to apply it paying a reasonable bill. Also, of the
>>>> push-based approach
>>>> with a Kafka topic as the medium.
>>>>
>>>> A.2.3 Atomicity – what does an implementation of RSM need to provide
>>>>
>>>> with
>>>>
>>>> respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
>>>> deleteTopicPartition? If a partial failure happens in any of those
>>>>
>>>> (e.g.
>>>>
>>>> in
>>>>
>>>> the S3 implementation, if one of the multiple uploads fails [4]),
>>>>
>>>> The S3 implementation is going to change, but it's worth clarifying
>>>>
>>>> anyway.
>>>>
>>>> The segment log file is being uploaded after S3 has acked uploading of
>>>> all other files associated with the segment and only after this the
>>>>
>>>> whole
>>>>
>>>> segment file set becomes visible remotely for operations like
>>>> listRemoteSegments [1].
>>>> In case of upload failure, the files that has been successfully
>>>>
>>>> uploaded
>>>>
>>>> stays
>>>> as invisible garbage that is collected by cleanupLogUntil (or
>>>>
>>>> overwritten
>>>>
>>>> successfully later).
>>>> And the opposite happens during the deletion: log files are deleted
>>>>
>>>> first.
>>>>
>>>> This approach should generally work when we solve consistency issues by
>>>> adding a strongly consistent storage: a segment's uploaded files
>>>>
>>>> remain
>>>>
>>>> invisible garbage until some metadata about them is written.
>>>>
>>>> A.3 Caching – storing locally the segments retrieved from the remote
>>>> storage is excluded as it does not align with the original intent
>>>>
>>>> and even
>>>>
>>>> defeat some of its purposes (save disk space etc.). That said, could
>>>>
>>>> there
>>>>
>>>> be other types of use cases where the pattern of access to the
>>>>
>>>> remotely
>>>>
>>>> stored segments would benefit from local caching (and potentially
>>>> read-ahead)? Consider the use case of a large pool of consumers which
>>>>
>>>> start
>>>>
>>>> a backfill at the same time for one day worth of data from one year
>>>>
>>>> ago
>>>>
>>>> stored remotely. Caching the segments locally would allow to
>>>>
>>>> uncouple the
>>>>
>>>> load on the remote storage from the load on the Kafka cluster. Maybe
>>>>
>>>> the
>>>>
>>>> RLM could expose a configuration parameter to switch that feature
>>>>
>>>> on/off?
>>>>
>>>> I tend to agree here, caching remote segments locally and making this
>>>> configurable sounds pretty practical to me. We should implement
>>>>
>>>> this,
>>>>
>>>> maybe not in the first iteration.
>>>>
>>>> Br,
>>>> Ivan
>>>>
>>>> [1]
>>>>
>>>>
>>>> https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152
>>>>
>>>> On Thu, 19 Dec 2019 at 19:49, Alexandre Dupriez <
>>>>
>>>> alexandre.dupr...@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi Jun,
>>>>
>>>> Thank you for the feedback. I am trying to understand how a
>>>>
>>>> push-based
>>>>
>>>> approach would work.
>>>> In order for the metadata to be propagated (under the assumption you
>>>> stated), would you plan to add a new API in Kafka to allow the metadata
>>>> store to send them directly to the brokers?
>>>>
>>>> Thanks,
>>>> Alexandre
>>>>
>>>> Le mer. 18 déc. 2019 à 20:14, Jun Rao <j...@confluent.io> a écrit :
>>>>
>>>> Hi, Satish and Ivan,
>>>>
>>>> There are different ways for doing push based metadata
>>>>
>>>> propagation. Some
>>>>
>>>> object stores may support that already. For example, S3 supports
>>>>
>>>> events
>>>>
>>>> notification (
>>>>
>>>> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html).
>>>>
>>>>
>>>> Otherwise one could use a separate metadata store that supports
>>>>
>>>> push-based
>>>>
>>>> change propagation. Other people have mentioned using a Kafka
>>>>
>>>> topic. The
>>>>
>>>> best approach may depend on the object store and the operational
>>>> environment (e.g. whether an external metadata store is already
>>>>
>>>> available).
>>>>
>>>> The above discussion is based on the assumption that we need to
>>>>
>>>> cache the
>>>>
>>>> object metadata locally in every broker. I mentioned earlier that
>>>>
>>>> an
>>>>
>>>> alternative is to just store/retrieve those metadata in an external
>>>> metadata store. That may simplify the implementation in some cases.
>>>>
>>>> Thanks,
>>>>
>>>> Jun
>>>>
>>>> On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana <
>>>>
>>>> satish.dugg...@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi Jun,
>>>> Thanks for your reply.
>>>>
>>>> Currently, `listRemoteSegments` is called at the configured
>>>> interval(not every second, defaults to 30secs). Storing remote
>>>>
>>>> log
>>>>
>>>> metadata in a strongly consistent store for S3 RSM is raised in
>>>> PR-comment[1].
>>>> RLM invokes RSM at regular intervals and RSM can give remote
>>>>
>>>> segment
>>>>
>>>> metadata if it is available. RSM is responsible for maintaining
>>>>
>>>> and
>>>>
>>>> fetching those entries. It should be based on whatever mechanism
>>>>
>>>> is
>>>>
>>>> consistent and efficient with the respective remote storage.
>>>>
>>>> Can you give more details about push based mechanism from RSM?
>>>>
>>>> 1.
>>>>
>>>> https://github.com/apache/kafka/pull/7561#discussion_r344576223
>>>>
>>>> Thanks,
>>>> Satish.
>>>>
>>>> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao <j...@confluent.io> wrote:
>>>>
>>>> Hi, Harsha,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> 40/41. I am curious which block storages you have tested. S3
>>>>
>>>> seems
>>>>
>>>> to be
>>>>
>>>> one of the popular block stores. The concerns that I have with
>>>>
>>>> pull
>>>>
>>>> based
>>>>
>>>> approach are the following.
>>>> (a) Cost: S3 list object requests cost $0.005 per 1000
>>>>
>>>> requests. If
>>>>
>>>> you
>>>>
>>>> have 100,000 partitions and want to pull the metadata for each
>>>>
>>>> partition
>>>>
>>>> at
>>>>
>>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K
>>>>
>>>> per
>>>>
>>>> day.
>>>>
>>>> (b) Semantics: S3 list objects are eventually consistent. So,
>>>>
>>>> when
>>>>
>>>> you
>>>>
>>>> do a
>>>>
>>>> list object request, there is no guarantee that you can see all
>>>>
>>>> uploaded
>>>>
>>>> objects. This could impact the correctness of subsequent
>>>>
>>>> logics.
>>>>
>>>> (c) Efficiency: Blindly pulling metadata when there is no
>>>>
>>>> change adds
>>>>
>>>> unnecessary overhead in the broker as well as in the block
>>>>
>>>> store.
>>>>
>>>> So, have you guys tested S3? If so, could you share your
>>>>
>>>> experience
>>>>
>>>> in
>>>>
>>>> terms of cost, semantics and efficiency?
>>>>
>>>> Jun
>>>>
>>>> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani <
>>>>
>>>> ka...@harsha.io
>>>>
>>>> wrote:
>>>>
>>>> Hi Jun,
>>>> Thanks for the reply.
>>>>
>>>> On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao <j...@confluent.io>
>>>>
>>>> wrote:
>>>>
>>>> Hi, Satish and Ying,
>>>>
>>>> Thanks for the reply.
>>>>
>>>> 40/41. There are two different ways that we can approach
>>>>
>>>> this.
>>>>
>>>> One is
>>>>
>>>> what
>>>>
>>>> you said. We can have an opinionated way of storing and
>>>>
>>>> populating
>>>>
>>>> the
>>>>
>>>> tier
>>>>
>>>> metadata that we think is good enough for everyone. I am
>>>>
>>>> not
>>>>
>>>> sure if
>>>>
>>>> this
>>>>
>>>> is the case based on what's currently proposed in the KIP.
>>>>
>>>> For
>>>>
>>>> example, I
>>>>
>>>> am not sure that (1) everyone always needs local metadata;
>>>>
>>>> (2)
>>>>
>>>> the
>>>>
>>>> current
>>>>
>>>> local storage format is general enough and (3) everyone
>>>>
>>>> wants to
>>>>
>>>> use
>>>>
>>>> the
>>>>
>>>> pull based approach to propagate the metadata. Another
>>>>
>>>> approach
>>>>
>>>> is to
>>>>
>>>> make
>>>>
>>>> this pluggable and let the implementor implements the best
>>>>
>>>> approach
>>>>
>>>> for a
>>>>
>>>> particular block storage. I haven't seen any comments from
>>>>
>>>> Slack/AirBnb
>>>>
>>>> in
>>>>
>>>> the mailing list on this topic. It would be great if they
>>>>
>>>> can
>>>>
>>>> provide
>>>>
>>>> feedback directly here.
>>>>
>>>> The current interfaces are designed with most popular block
>>>>
>>>> storages
>>>>
>>>> available today and we did 2 implementations with these
>>>>
>>>> interfaces and
>>>>
>>>> they both are yielding good results as we going through the
>>>>
>>>> testing of
>>>>
>>>> it.
>>>>
>>>> If there is ever a need for pull based approach we can
>>>>
>>>> definitely
>>>>
>>>> evolve
>>>>
>>>> the interface.
>>>> In the past we did mark interfaces to be evolving to make
>>>>
>>>> room for
>>>>
>>>> unknowns
>>>>
>>>> in the future.
>>>> If you have any suggestions around the current interfaces
>>>>
>>>> please
>>>>
>>>> propose we
>>>>
>>>> are happy to see if we can work them into it.
>>>>
>>>> 43. To offer tier storage as a general feature, ideally all
>>>>
>>>> existing
>>>>
>>>> capabilities should still be supported. It's fine if the
>>>>
>>>> uber
>>>>
>>>> implementation doesn't support all capabilities for
>>>>
>>>> internal
>>>>
>>>> usage.
>>>>
>>>> However, the framework should be general enough.
>>>>
>>>> We agree on that as a principle. But all of these major
>>>>
>>>> features
>>>>
>>>> mostly
>>>>
>>>> coming right now and to have a new big feature such as tiered
>>>>
>>>> storage
>>>>
>>>> to
>>>>
>>>> support all the new features will be a big ask. We can
>>>>
>>>> document on
>>>>
>>>> how
>>>>
>>>> do
>>>>
>>>> we approach solving these in future iterations.
>>>> Our goal is to make this tiered storage feature work for
>>>>
>>>> everyone.
>>>>
>>>> 43.3 This is more than just serving the tier-ed data from
>>>>
>>>> block
>>>>
>>>> storage.
>>>>
>>>> With KIP-392, the consumer now can resolve the conflicts
>>>>
>>>> with the
>>>>
>>>> replica
>>>>
>>>> based on leader epoch. So, we need to make sure that
>>>>
>>>> leader epoch
>>>>
>>>> can be
>>>>
>>>> recovered properly from tier storage.
>>>>
>>>> We are working on testing our approach and we will update
>>>>
>>>> the KIP
>>>>
>>>> with
>>>>
>>>> design details.
>>>>
>>>> 43.4 For JBOD, if tier storage stores the tier metadata
>>>>
>>>> locally, we
>>>>
>>>> need to
>>>>
>>>> support moving such metadata across disk directories since
>>>>
>>>> JBOD
>>>>
>>>> supports
>>>>
>>>> moving data across disks.
>>>>
>>>> KIP is updated with JBOD details. Having said that JBOD
>>>>
>>>> tooling
>>>>
>>>> needs
>>>>
>>>> to
>>>>
>>>> evolve to support production loads. Most of the users will be
>>>>
>>>> interested in
>>>>
>>>> using tiered storage without JBOD support support on day 1.
>>>>
>>>> Thanks,
>>>> Harsha
>>>>
>>>> As for meeting, we could have a KIP e-meeting on this if
>>>>
>>>> needed,
>>>>
>>>> but it
>>>>
>>>> will be open to everyone and will be recorded and shared.
>>>>
>>>> Often,
>>>>
>>>> the
>>>>
>>>> details are still resolved through the mailing list.
>>>>
>>>> Jun
>>>>
>>>> On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng
>>>>
>>>> <yi...@uber.com.invalid>
>>>>
>>>> wrote:
>>>>
>>>> Please ignore my previous email
>>>> I didn't know Apache requires all the discussions to be
>>>>
>>>> "open"
>>>>
>>>> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com>
>>>>
>>>> wrote:
>>>>
>>>> Hi Jun,
>>>>
>>>> Thank you very much for your feedback!
>>>>
>>>> Can we schedule a meeting in your Palo Alto office in
>>>>
>>>> December? I
>>>>
>>>> think a
>>>>
>>>> face to face discussion is much more efficient than
>>>>
>>>> emails. Both
>>>>
>>>> Harsha
>>>>
>>>> and
>>>>
>>>> I can visit you. Satish may be able to join us remotely.
>>>>
>>>> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io
>>>>
>>>> wrote:
>>>>
>>>> Hi, Satish and Harsha,
>>>>
>>>> The following is a more detailed high level feedback for
>>>>
>>>> the KIP.
>>>>
>>>> Overall,
>>>>
>>>> the KIP seems useful. The challenge is how to design it
>>>>
>>>> such that
>>>>
>>>> it’s
>>>>
>>>> general enough to support different ways of implementing
>>>>
>>>> this
>>>>
>>>> feature
>>>>
>>>> and
>>>>
>>>> support existing features.
>>>>
>>>> 40. Local segment metadata storage: The KIP makes the
>>>>
>>>> assumption
>>>>
>>>> that
>>>>
>>>> the
>>>>
>>>> metadata for the archived log segments are cached locally
>>>>
>>>> in
>>>>
>>>> every
>>>>
>>>> broker
>>>>
>>>> and provides a specific implementation for the local
>>>>
>>>> storage in
>>>>
>>>> the
>>>>
>>>> framework. We probably should discuss this more. For
>>>>
>>>> example,
>>>>
>>>> some
>>>>
>>>> tier
>>>>
>>>> storage providers may not want to cache the metadata
>>>>
>>>> locally and
>>>>
>>>> just
>>>>
>>>> rely
>>>>
>>>> upon a remote key/value store if such a store is already
>>>>
>>>> present. If
>>>>
>>>> a
>>>>
>>>> local store is used, there could be different ways of
>>>>
>>>> implementing it
>>>>
>>>> (e.g., based on customized local files, an embedded local
>>>>
>>>> store
>>>>
>>>> like
>>>>
>>>> RocksDB, etc). An alternative of designing this is to just
>>>>
>>>> provide an
>>>>
>>>> interface for retrieving the tier segment metadata and
>>>>
>>>> leave the
>>>>
>>>> details
>>>>
>>>> of
>>>>
>>>> how to get the metadata outside of the framework.
>>>>
>>>> 41. RemoteStorageManager interface and the usage of the
>>>>
>>>> interface in
>>>>
>>>> the
>>>>
>>>> framework: I am not sure if the interface is general
>>>>
>>>> enough. For
>>>>
>>>> example,
>>>>
>>>> it seems that RemoteLogIndexEntry is tied to a specific
>>>>
>>>> way of
>>>>
>>>> storing
>>>>
>>>> the
>>>>
>>>> metadata in remote storage. The framework uses
>>>>
>>>> listRemoteSegments()
>>>>
>>>> api
>>>>
>>>> in
>>>>
>>>> a pull based approach. However, in some other
>>>>
>>>> implementations, a
>>>>
>>>> push
>>>>
>>>> based
>>>> approach may be more preferred. I don’t have a concrete
>>>>
>>>> proposal
>>>>
>>>> yet.
>>>>
>>>> But,
>>>>
>>>> it would be useful to give this area some more thoughts
>>>>
>>>> and see
>>>>
>>>> if we
>>>>
>>>> can
>>>>
>>>> make the interface more general.
>>>>
>>>> 42. In the diagram, the RemoteLogManager is side by side
>>>>
>>>> with
>>>>
>>>> LogManager.
>>>>
>>>> This KIP only discussed how the fetch request is handled
>>>>
>>>> between
>>>>
>>>> the
>>>>
>>>> two
>>>>
>>>> layer. However, we should also consider how other requests
>>>>
>>>> that
>>>>
>>>> touch
>>>>
>>>> the
>>>>
>>>> log can be handled. e.g., list offsets by timestamp, delete
>>>>
>>>> records,
>>>>
>>>> etc.
>>>>
>>>> Also, in this model, it's not clear which component is
>>>>
>>>> responsible
>>>>
>>>> for
>>>>
>>>> managing the log start offset. It seems that the log start
>>>>
>>>> offset
>>>>
>>>> could
>>>>
>>>> be
>>>>
>>>> changed by both RemoteLogManager and LogManager.
>>>>
>>>> 43. There are quite a few existing features not covered by
>>>>
>>>> the
>>>>
>>>> KIP.
>>>>
>>>> It
>>>>
>>>> would be useful to discuss each of those.
>>>> 43.1 I won’t say that compacted topics are rarely used and
>>>>
>>>> always
>>>>
>>>> small.
>>>>
>>>> For example, KStreams uses compacted topics for storing the
>>>>
>>>> states
>>>>
>>>> and
>>>>
>>>> sometimes the size of the topic could be large. While it
>>>>
>>>> might
>>>>
>>>> be ok
>>>>
>>>> to
>>>>
>>>> not
>>>>
>>>> support compacted topics initially, it would be useful to
>>>>
>>>> have a
>>>>
>>>> high
>>>>
>>>> level
>>>> idea on how this might be supported down the road so that
>>>>
>>>> we
>>>>
>>>> don’t
>>>>
>>>> have
>>>>
>>>> to
>>>>
>>>> make incompatible API changes in the future.
>>>> 43.2 We need to discuss how EOS is supported. In
>>>>
>>>> particular, how
>>>>
>>>> is
>>>>
>>>> the
>>>>
>>>> producer state integrated with the remote storage. 43.3
>>>>
>>>> Now that
>>>>
>>>> KIP-392
>>>>
>>>> (allow consumers to fetch from closest replica) is
>>>>
>>>> implemented,
>>>>
>>>> we
>>>>
>>>> need
>>>>
>>>> to
>>>>
>>>> discuss how reading from a follower replica is supported
>>>>
>>>> with
>>>>
>>>> tier
>>>>
>>>> storage.
>>>>
>>>> 43.4 We need to discuss how JBOD is supported with tier
>>>>
>>>> storage.
>>>>
>>>> Thanks,
>>>>
>>>> Jun
>>>>
>>>> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <
>>>>
>>>> tbent...@redhat.com
>>>>
>>>> wrote:
>>>>
>>>> Thanks for those insights Ying.
>>>>
>>>> On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng
>>>>
>>>> <yi...@uber.com.invalid
>>>>
>>>> wrote:
>>>>
>>>> Thanks, I missed that point. However, there's still a
>>>>
>>>> point at
>>>>
>>>> which
>>>>
>>>> the
>>>>
>>>> consumer fetches start getting served from remote storage
>>>>
>>>> (even
>>>>
>>>> if
>>>>
>>>> that
>>>>
>>>> point isn't as soon as the local log retention time/size).
>>>>
>>>> This
>>>>
>>>> represents
>>>>
>>>> a kind of performance cliff edge and what I'm really
>>>>
>>>> interested
>>>>
>>>> in
>>>>
>>>> is
>>>>
>>>> how
>>>>
>>>> easy it is for a consumer which falls off that cliff to
>>>>
>>>> catch up
>>>>
>>>> and so
>>>>
>>>> its
>>>>
>>>> fetches again come from local storage. Obviously this can
>>>>
>>>> depend
>>>>
>>>> on
>>>>
>>>> all
>>>>
>>>> sorts of factors (like production rate, consumption rate),
>>>>
>>>> so
>>>>
>>>> it's
>>>>
>>>> not
>>>>
>>>> guaranteed (just like it's not guaranteed for Kafka
>>>>
>>>> today), but
>>>>
>>>> this
>>>>
>>>> would
>>>>
>>>> represent a new failure mode.
>>>>
>>>> As I have explained in the last mail, it's a very rare
>>>>
>>>> case that
>>>>
>>>> a
>>>>
>>>> consumer
>>>> need to read remote data. With our experience at Uber,
>>>>
>>>> this only
>>>>
>>>> happens
>>>>
>>>> when the consumer service had an outage for several hours.
>>>>
>>>> There is not a "performance cliff" as you assume. The
>>>>
>>>> remote
>>>>
>>>> storage
>>>>
>>>> is
>>>>
>>>> even faster than local disks in terms of bandwidth.
>>>>
>>>> Reading from
>>>>
>>>> remote
>>>>
>>>> storage is going to have higher latency than local disk.
>>>>
>>>> But
>>>>
>>>> since
>>>>
>>>> the
>>>>
>>>> consumer
>>>> is catching up several hours data, it's not sensitive to
>>>>
>>>> the
>>>>
>>>> sub-second
>>>>
>>>> level
>>>> latency, and each remote read request will read a large
>>>>
>>>> amount of
>>>>
>>>> data to
>>>>
>>>> make the overall performance better than reading from local
>>>>
>>>> disks.
>>>>
>>>> Another aspect I'd like to understand better is the effect
>>>>
>>>> of
>>>>
>>>> serving
>>>>
>>>> fetch
>>>>
>>>> request from remote storage has on the broker's network
>>>>
>>>> utilization. If
>>>>
>>>> we're just trimming the amount of data held locally
>>>>
>>>> (without
>>>>
>>>> increasing
>>>>
>>>> the
>>>>
>>>> overall local+remote retention), then we're effectively
>>>>
>>>> trading
>>>>
>>>> disk
>>>>
>>>> bandwidth for network bandwidth when serving fetch
>>>>
>>>> requests from
>>>>
>>>> remote
>>>>
>>>> storage (which I understand to be a good thing, since
>>>>
>>>> brokers are
>>>>
>>>> often/usually disk bound). But if we're increasing the
>>>>
>>>> overall
>>>>
>>>> local+remote
>>>>
>>>> retention then it's more likely that network itself
>>>>
>>>> becomes the
>>>>
>>>> bottleneck.
>>>>
>>>> I appreciate this is all rather hand wavy, I'm just trying
>>>>
>>>> to
>>>>
>>>> understand
>>>>
>>>> how this would affect broker performance, so I'd be
>>>>
>>>> grateful for
>>>>
>>>> any
>>>>
>>>> insights you can offer.
>>>>
>>>> Network bandwidth is a function of produce speed, it has
>>>>
>>>> nothing
>>>>
>>>> to
>>>>
>>>> do
>>>>
>>>> with
>>>>
>>>> remote retention. As long as the data is shipped to remote
>>>>
>>>> storage,
>>>>
>>>> you
>>>>
>>>> can
>>>>
>>>> keep the data there for 1 day or 1 year or 100 years, it
>>>>
>>>> doesn't
>>>>
>>>> consume
>>>>
>>>> any
>>>> network resources.
>>>>
>>>>
>>>

Reply via email to