Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Jun Rao Mon, 18 May 2020 12:00:58 -0700

Hi, Satish,

Thanks for the update. Just to clarify. Which doc has the latest updates,
the wiki or the google doc?


Jun

On Thu, May 14, 2020 at 10:38 AM Satish Duggana <satish.dugg...@gmail.com>
wrote:

> Hi Jun,
> Thanks for your comments.  We updated the KIP with more details.
>
> >100. For each of the operations related to tiering, it would be useful to
> provide a description on how it works with the new API. These include
> things like consumer fetch, replica fetch, offsetForTimestamp, retention
> (remote and local) by size, time and logStartOffset, topic deletion, etc.
> This will tell us if the proposed APIs are sufficient.
>
> We addressed most of these APIs in the KIP. We can add more details if
> needed.
>
> >101. For the default implementation based on internal topic, is it meant
> as a proof of concept or for production usage? I assume that it's the
> former. However, if it's the latter, then the KIP needs to describe the
> design in more detail.
>
> It is production usage as was mentioned in an earlier mail. We plan to
> update this section in the next few days.
>
> >102. When tiering a segment, the segment is first written to the object
> store and then its metadata is written to RLMM using the api "void 
> putRemoteLogSegmentData()".
> One potential issue with this approach is that if the system fails after
> the first operation, it leaves a garbage in the object store that's never
> reclaimed. One way to improve this is to have two separate APIs, sth like
> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>
> That is a good point. We currently have a different way using markers in
> the segment but your suggestion is much better.
>
> >103. It seems that the transactional support and the ability to read
> from follower are missing.
>
> KIP is updated with transactional support, follower fetch semantics, and
> reading from a follower.
>
> >104. It would be useful to provide a testing plan for this KIP.
>
> We added a few tests by introducing test util for tiered storage in the
> PR. We will provide the testing plan in the next few days.
>
> Thanks,
> Satish.
>
>
> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani <ka...@harsha.io>
> wrote:
>
>>
>>
>>
>>
>> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao <j...@confluent.io> wrote:
>>
>>> Hi, Satish,
>>>
>>> Thanks for the updated doc. The new API seems to be an improvement
>>> overall. A few more comments below.
>>>
>>> 100. For each of the operations related to tiering, it would be useful
>>> to provide a description on how it works with the new API. These include
>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention
>>> (remote and local) by size, time and logStartOffset, topic deletion,
>>> etc. This will tell us if the proposed APIs are sufficient.
>>>
>>
>> Thanks for the feedback Jun. We will add more details around this.
>>
>>
>>> 101. For the default implementation based on internal topic, is it meant
>>> as a proof of concept or for production usage? I assume that it's the
>>> former. However, if it's the latter, then the KIP needs to describe the
>>> design in more detail.
>>>
>>
>> Yes it meant to be for production use.  Ideally it would be good to merge
>> this in as the default implementation for metadata service. We can add more
>> details around design and testing.
>>
>> 102. When tiering a segment, the segment is first written to the object
>>> store and then its metadata is written to RLMM using the api "void
>>> putRemoteLogSegmentData()".
>>> One potential issue with this approach is that if the system fails after
>>> the first operation, it leaves a garbage in the object store that's never
>>> reclaimed. One way to improve this is to have two separate APIs, sth like
>>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData().
>>>
>>> 103. It seems that the transactional support and the ability to read
>>> from follower are missing.
>>>
>>> 104. It would be useful to provide a testing plan for this KIP.
>>>
>>
>> We are working on adding more details around transactional support and
>> coming up with test plan.
>> Add system tests and integration tests.
>>
>> Thanks,
>>>
>>> Jun
>>>
>>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana <satish.dugg...@gmail.com>
>>> wrote:
>>>
>>> Hi Jun,
>>> Please look at the earlier reply and let us know your comments.
>>>
>>> Thanks,
>>> Satish.
>>>
>>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana <satish.dugg...@gmail.com>
>>> wrote:
>>>
>>> Hi Jun,
>>> Thanks for your comments on the separation of remote log metadata
>>> storage and remote log storage.
>>> We had a few discussions since early Jan on how to support eventually
>>> consistent stores like S3 by uncoupling remote log segment metadata and
>>> remote log storage. It is written with details in the doc here(1). Below is
>>> the brief summary of the discussion from that doc.
>>>
>>> The current approach consists of pulling the remote log segment metadata
>>> from remote log storage APIs. It worked fine for storages like HDFS. But
>>> one of the problems of relying on the remote storage to maintain metadata
>>> is that tiered-storage needs to be strongly consistent, with an impact not
>>> only on the metadata(e.g. LIST in S3) but also on the segment data(e.g. GET
>>> after a DELETE in S3). The cost of maintaining metadata in remote storage
>>> needs to be factored in. This is true in the case of S3, LIST APIs incur
>>> huge costs as you raised earlier.
>>> So, it is good to separate the remote storage from the remote log
>>> metadata store. We refactored the existing RemoteStorageManager and
>>> introduced RemoteLogMetadataManager. Remote log metadata store should give
>>> strong consistency semantics but remote log storage can be eventually
>>> consistent.
>>> We can have a default implementation for RemoteLogMetadataManager which
>>> uses an internal topic(as mentioned in one of our earlier emails) as
>>> storage. But users can always plugin their own RemoteLogMetadataManager
>>> implementation based on their environment.
>>>
>>> Please go through the updated KIP and let us know your comments. We have
>>> started refactoring for the changes mentioned in the KIP and there may be a
>>> few more updates to the APIs.
>>>
>>> [1]
>>>
>>>
>>> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7#
>>>
>>> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko <ivan0yurche...@gmail.com>
>>>
>>>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> Jun:
>>>
>>> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If
>>>
>>> you
>>>
>>> have 100,000 partitions and want to pull the metadata for each
>>>
>>> partition
>>>
>>> at
>>>
>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per
>>>
>>> day.
>>>
>>> I want to note here, that no reasonably durable storage will be cheap at
>>> 100k RPS. For example, DynamoDB might give the same ballpark
>>>
>>> figures.
>>>
>>> If we want to keep the pull-based approach, we can try to reduce this
>>>
>>> number
>>>
>>> in several ways: doing listings less frequently (as Satish mentioned,
>>> with the current defaults it's ~3.33k RPS for your example), batching
>>> listing operations in some way (depending on the storage; it might require
>>> the change of RSM's interface).
>>>
>>> There are different ways for doing push based metadata propagation.
>>>
>>> Some
>>>
>>> object stores may support that already. For example, S3 supports
>>>
>>> events
>>>
>>> notification
>>>
>>> This sounds interesting. However, I see a couple of issues using it:
>>> 1. As I understand the documentation, notification delivery is not
>>> guaranteed
>>> and it's recommended to periodically do LIST to fill the gaps. Which
>>> brings us back to the same LIST consistency guarantees issue.
>>> 2. The same goes for the broker start: to get the current state, we
>>>
>>> need
>>>
>>> to LIST.
>>> 3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS
>>>
>>> aren't
>>>
>>> designed for such a case.
>>>
>>> Alexandre:
>>>
>>> A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM
>>>
>>> cannot
>>>
>>> relies solely on S3 APIs to guarantee the expected strong
>>>
>>> consistency. The
>>>
>>> proposed implementation [3] would need to be updated to take this
>>>
>>> into
>>>
>>> account. Let’s talk more about this.
>>>
>>> Thank you for the feedback. I clearly see the need for changing the S3
>>> implementation
>>> to provide stronger consistency guarantees. As it see from this thread,
>>> there are
>>> several possible approaches to this. Let's discuss RemoteLogManager's
>>> contract and
>>> behavior (like pull vs push model) further before picking one (or
>>>
>>> several -
>>>
>>> ?) of them.
>>> I'm going to do some evaluation of DynamoDB for the pull-based
>>>
>>> approach,
>>>
>>> if it's possible to apply it paying a reasonable bill. Also, of the
>>> push-based approach
>>> with a Kafka topic as the medium.
>>>
>>> A.2.3 Atomicity – what does an implementation of RSM need to provide
>>>
>>> with
>>>
>>> respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and
>>> deleteTopicPartition? If a partial failure happens in any of those
>>>
>>> (e.g.
>>>
>>> in
>>>
>>> the S3 implementation, if one of the multiple uploads fails [4]),
>>>
>>> The S3 implementation is going to change, but it's worth clarifying
>>>
>>> anyway.
>>>
>>> The segment log file is being uploaded after S3 has acked uploading of
>>> all other files associated with the segment and only after this the
>>>
>>> whole
>>>
>>> segment file set becomes visible remotely for operations like
>>> listRemoteSegments [1].
>>> In case of upload failure, the files that has been successfully
>>>
>>> uploaded
>>>
>>> stays
>>> as invisible garbage that is collected by cleanupLogUntil (or
>>>
>>> overwritten
>>>
>>> successfully later).
>>> And the opposite happens during the deletion: log files are deleted
>>>
>>> first.
>>>
>>> This approach should generally work when we solve consistency issues by
>>> adding a strongly consistent storage: a segment's uploaded files
>>>
>>> remain
>>>
>>> invisible garbage until some metadata about them is written.
>>>
>>> A.3 Caching – storing locally the segments retrieved from the remote
>>> storage is excluded as it does not align with the original intent
>>>
>>> and even
>>>
>>> defeat some of its purposes (save disk space etc.). That said, could
>>>
>>> there
>>>
>>> be other types of use cases where the pattern of access to the
>>>
>>> remotely
>>>
>>> stored segments would benefit from local caching (and potentially
>>> read-ahead)? Consider the use case of a large pool of consumers which
>>>
>>> start
>>>
>>> a backfill at the same time for one day worth of data from one year
>>>
>>> ago
>>>
>>> stored remotely. Caching the segments locally would allow to
>>>
>>> uncouple the
>>>
>>> load on the remote storage from the load on the Kafka cluster. Maybe
>>>
>>> the
>>>
>>> RLM could expose a configuration parameter to switch that feature
>>>
>>> on/off?
>>>
>>> I tend to agree here, caching remote segments locally and making this
>>> configurable sounds pretty practical to me. We should implement
>>>
>>> this,
>>>
>>> maybe not in the first iteration.
>>>
>>> Br,
>>> Ivan
>>>
>>> [1]
>>>
>>>
>>> https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152
>>>
>>> On Thu, 19 Dec 2019 at 19:49, Alexandre Dupriez <
>>>
>>> alexandre.dupr...@gmail.com>
>>>
>>> wrote:
>>>
>>> Hi Jun,
>>>
>>> Thank you for the feedback. I am trying to understand how a
>>>
>>> push-based
>>>
>>> approach would work.
>>> In order for the metadata to be propagated (under the assumption you
>>> stated), would you plan to add a new API in Kafka to allow the metadata
>>> store to send them directly to the brokers?
>>>
>>> Thanks,
>>> Alexandre
>>>
>>> Le mer. 18 déc. 2019 à 20:14, Jun Rao <j...@confluent.io> a écrit :
>>>
>>> Hi, Satish and Ivan,
>>>
>>> There are different ways for doing push based metadata
>>>
>>> propagation. Some
>>>
>>> object stores may support that already. For example, S3 supports
>>>
>>> events
>>>
>>> notification (
>>>
>>> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html).
>>>
>>>
>>> Otherwise one could use a separate metadata store that supports
>>>
>>> push-based
>>>
>>> change propagation. Other people have mentioned using a Kafka
>>>
>>> topic. The
>>>
>>> best approach may depend on the object store and the operational
>>> environment (e.g. whether an external metadata store is already
>>>
>>> available).
>>>
>>> The above discussion is based on the assumption that we need to
>>>
>>> cache the
>>>
>>> object metadata locally in every broker. I mentioned earlier that
>>>
>>> an
>>>
>>> alternative is to just store/retrieve those metadata in an external
>>> metadata store. That may simplify the implementation in some cases.
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana <
>>>
>>> satish.dugg...@gmail.com>
>>>
>>> wrote:
>>>
>>> Hi Jun,
>>> Thanks for your reply.
>>>
>>> Currently, `listRemoteSegments` is called at the configured interval(not
>>> every second, defaults to 30secs). Storing remote
>>>
>>> log
>>>
>>> metadata in a strongly consistent store for S3 RSM is raised in
>>> PR-comment[1].
>>> RLM invokes RSM at regular intervals and RSM can give remote
>>>
>>> segment
>>>
>>> metadata if it is available. RSM is responsible for maintaining
>>>
>>> and
>>>
>>> fetching those entries. It should be based on whatever mechanism
>>>
>>> is
>>>
>>> consistent and efficient with the respective remote storage.
>>>
>>> Can you give more details about push based mechanism from RSM?
>>>
>>> 1.
>>>
>>> https://github.com/apache/kafka/pull/7561#discussion_r344576223
>>>
>>> Thanks,
>>> Satish.
>>>
>>> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao <j...@confluent.io> wrote:
>>>
>>> Hi, Harsha,
>>>
>>> Thanks for the reply.
>>>
>>> 40/41. I am curious which block storages you have tested. S3
>>>
>>> seems
>>>
>>> to be
>>>
>>> one of the popular block stores. The concerns that I have with
>>>
>>> pull
>>>
>>> based
>>>
>>> approach are the following.
>>> (a) Cost: S3 list object requests cost $0.005 per 1000
>>>
>>> requests. If
>>>
>>> you
>>>
>>> have 100,000 partitions and want to pull the metadata for each
>>>
>>> partition
>>>
>>> at
>>>
>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K
>>>
>>> per
>>>
>>> day.
>>>
>>> (b) Semantics: S3 list objects are eventually consistent. So,
>>>
>>> when
>>>
>>> you
>>>
>>> do a
>>>
>>> list object request, there is no guarantee that you can see all
>>>
>>> uploaded
>>>
>>> objects. This could impact the correctness of subsequent
>>>
>>> logics.
>>>
>>> (c) Efficiency: Blindly pulling metadata when there is no
>>>
>>> change adds
>>>
>>> unnecessary overhead in the broker as well as in the block
>>>
>>> store.
>>>
>>> So, have you guys tested S3? If so, could you share your
>>>
>>> experience
>>>
>>> in
>>>
>>> terms of cost, semantics and efficiency?
>>>
>>> Jun
>>>
>>> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani <
>>>
>>> ka...@harsha.io
>>>
>>> wrote:
>>>
>>> Hi Jun,
>>> Thanks for the reply.
>>>
>>> On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao <j...@confluent.io>
>>>
>>> wrote:
>>>
>>> Hi, Satish and Ying,
>>>
>>> Thanks for the reply.
>>>
>>> 40/41. There are two different ways that we can approach
>>>
>>> this.
>>>
>>> One is
>>>
>>> what
>>>
>>> you said. We can have an opinionated way of storing and
>>>
>>> populating
>>>
>>> the
>>>
>>> tier
>>>
>>> metadata that we think is good enough for everyone. I am
>>>
>>> not
>>>
>>> sure if
>>>
>>> this
>>>
>>> is the case based on what's currently proposed in the KIP.
>>>
>>> For
>>>
>>> example, I
>>>
>>> am not sure that (1) everyone always needs local metadata;
>>>
>>> (2)
>>>
>>> the
>>>
>>> current
>>>
>>> local storage format is general enough and (3) everyone
>>>
>>> wants to
>>>
>>> use
>>>
>>> the
>>>
>>> pull based approach to propagate the metadata. Another
>>>
>>> approach
>>>
>>> is to
>>>
>>> make
>>>
>>> this pluggable and let the implementor implements the best
>>>
>>> approach
>>>
>>> for a
>>>
>>> particular block storage. I haven't seen any comments from
>>>
>>> Slack/AirBnb
>>>
>>> in
>>>
>>> the mailing list on this topic. It would be great if they
>>>
>>> can
>>>
>>> provide
>>>
>>> feedback directly here.
>>>
>>> The current interfaces are designed with most popular block
>>>
>>> storages
>>>
>>> available today and we did 2 implementations with these
>>>
>>> interfaces and
>>>
>>> they both are yielding good results as we going through the
>>>
>>> testing of
>>>
>>> it.
>>>
>>> If there is ever a need for pull based approach we can
>>>
>>> definitely
>>>
>>> evolve
>>>
>>> the interface.
>>> In the past we did mark interfaces to be evolving to make
>>>
>>> room for
>>>
>>> unknowns
>>>
>>> in the future.
>>> If you have any suggestions around the current interfaces
>>>
>>> please
>>>
>>> propose we
>>>
>>> are happy to see if we can work them into it.
>>>
>>> 43. To offer tier storage as a general feature, ideally all
>>>
>>> existing
>>>
>>> capabilities should still be supported. It's fine if the
>>>
>>> uber
>>>
>>> implementation doesn't support all capabilities for
>>>
>>> internal
>>>
>>> usage.
>>>
>>> However, the framework should be general enough.
>>>
>>> We agree on that as a principle. But all of these major
>>>
>>> features
>>>
>>> mostly
>>>
>>> coming right now and to have a new big feature such as tiered
>>>
>>> storage
>>>
>>> to
>>>
>>> support all the new features will be a big ask. We can
>>>
>>> document on
>>>
>>> how
>>>
>>> do
>>>
>>> we approach solving these in future iterations.
>>> Our goal is to make this tiered storage feature work for
>>>
>>> everyone.
>>>
>>> 43.3 This is more than just serving the tier-ed data from
>>>
>>> block
>>>
>>> storage.
>>>
>>> With KIP-392, the consumer now can resolve the conflicts
>>>
>>> with the
>>>
>>> replica
>>>
>>> based on leader epoch. So, we need to make sure that
>>>
>>> leader epoch
>>>
>>> can be
>>>
>>> recovered properly from tier storage.
>>>
>>> We are working on testing our approach and we will update
>>>
>>> the KIP
>>>
>>> with
>>>
>>> design details.
>>>
>>> 43.4 For JBOD, if tier storage stores the tier metadata
>>>
>>> locally, we
>>>
>>> need to
>>>
>>> support moving such metadata across disk directories since
>>>
>>> JBOD
>>>
>>> supports
>>>
>>> moving data across disks.
>>>
>>> KIP is updated with JBOD details. Having said that JBOD
>>>
>>> tooling
>>>
>>> needs
>>>
>>> to
>>>
>>> evolve to support production loads. Most of the users will be
>>>
>>> interested in
>>>
>>> using tiered storage without JBOD support support on day 1.
>>>
>>> Thanks,
>>> Harsha
>>>
>>> As for meeting, we could have a KIP e-meeting on this if
>>>
>>> needed,
>>>
>>> but it
>>>
>>> will be open to everyone and will be recorded and shared.
>>>
>>> Often,
>>>
>>> the
>>>
>>> details are still resolved through the mailing list.
>>>
>>> Jun
>>>
>>> On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng
>>>
>>> <yi...@uber.com.invalid>
>>>
>>> wrote:
>>>
>>> Please ignore my previous email
>>> I didn't know Apache requires all the discussions to be
>>>
>>> "open"
>>>
>>> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com>
>>>
>>> wrote:
>>>
>>> Hi Jun,
>>>
>>> Thank you very much for your feedback!
>>>
>>> Can we schedule a meeting in your Palo Alto office in
>>>
>>> December? I
>>>
>>> think a
>>>
>>> face to face discussion is much more efficient than
>>>
>>> emails. Both
>>>
>>> Harsha
>>>
>>> and
>>>
>>> I can visit you. Satish may be able to join us remotely.
>>>
>>> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io
>>>
>>> wrote:
>>>
>>> Hi, Satish and Harsha,
>>>
>>> The following is a more detailed high level feedback for
>>>
>>> the KIP.
>>>
>>> Overall,
>>>
>>> the KIP seems useful. The challenge is how to design it
>>>
>>> such that
>>>
>>> it’s
>>>
>>> general enough to support different ways of implementing
>>>
>>> this
>>>
>>> feature
>>>
>>> and
>>>
>>> support existing features.
>>>
>>> 40. Local segment metadata storage: The KIP makes the
>>>
>>> assumption
>>>
>>> that
>>>
>>> the
>>>
>>> metadata for the archived log segments are cached locally
>>>
>>> in
>>>
>>> every
>>>
>>> broker
>>>
>>> and provides a specific implementation for the local
>>>
>>> storage in
>>>
>>> the
>>>
>>> framework. We probably should discuss this more. For
>>>
>>> example,
>>>
>>> some
>>>
>>> tier
>>>
>>> storage providers may not want to cache the metadata
>>>
>>> locally and
>>>
>>> just
>>>
>>> rely
>>>
>>> upon a remote key/value store if such a store is already
>>>
>>> present. If
>>>
>>> a
>>>
>>> local store is used, there could be different ways of
>>>
>>> implementing it
>>>
>>> (e.g., based on customized local files, an embedded local
>>>
>>> store
>>>
>>> like
>>>
>>> RocksDB, etc). An alternative of designing this is to just
>>>
>>> provide an
>>>
>>> interface for retrieving the tier segment metadata and
>>>
>>> leave the
>>>
>>> details
>>>
>>> of
>>>
>>> how to get the metadata outside of the framework.
>>>
>>> 41. RemoteStorageManager interface and the usage of the
>>>
>>> interface in
>>>
>>> the
>>>
>>> framework: I am not sure if the interface is general
>>>
>>> enough. For
>>>
>>> example,
>>>
>>> it seems that RemoteLogIndexEntry is tied to a specific
>>>
>>> way of
>>>
>>> storing
>>>
>>> the
>>>
>>> metadata in remote storage. The framework uses
>>>
>>> listRemoteSegments()
>>>
>>> api
>>>
>>> in
>>>
>>> a pull based approach. However, in some other
>>>
>>> implementations, a
>>>
>>> push
>>>
>>> based
>>> approach may be more preferred. I don’t have a concrete
>>>
>>> proposal
>>>
>>> yet.
>>>
>>> But,
>>>
>>> it would be useful to give this area some more thoughts
>>>
>>> and see
>>>
>>> if we
>>>
>>> can
>>>
>>> make the interface more general.
>>>
>>> 42. In the diagram, the RemoteLogManager is side by side
>>>
>>> with
>>>
>>> LogManager.
>>>
>>> This KIP only discussed how the fetch request is handled
>>>
>>> between
>>>
>>> the
>>>
>>> two
>>>
>>> layer. However, we should also consider how other requests
>>>
>>> that
>>>
>>> touch
>>>
>>> the
>>>
>>> log can be handled. e.g., list offsets by timestamp, delete
>>>
>>> records,
>>>
>>> etc.
>>>
>>> Also, in this model, it's not clear which component is
>>>
>>> responsible
>>>
>>> for
>>>
>>> managing the log start offset. It seems that the log start
>>>
>>> offset
>>>
>>> could
>>>
>>> be
>>>
>>> changed by both RemoteLogManager and LogManager.
>>>
>>> 43. There are quite a few existing features not covered by
>>>
>>> the
>>>
>>> KIP.
>>>
>>> It
>>>
>>> would be useful to discuss each of those.
>>> 43.1 I won’t say that compacted topics are rarely used and
>>>
>>> always
>>>
>>> small.
>>>
>>> For example, KStreams uses compacted topics for storing the
>>>
>>> states
>>>
>>> and
>>>
>>> sometimes the size of the topic could be large. While it
>>>
>>> might
>>>
>>> be ok
>>>
>>> to
>>>
>>> not
>>>
>>> support compacted topics initially, it would be useful to
>>>
>>> have a
>>>
>>> high
>>>
>>> level
>>> idea on how this might be supported down the road so that
>>>
>>> we
>>>
>>> don’t
>>>
>>> have
>>>
>>> to
>>>
>>> make incompatible API changes in the future.
>>> 43.2 We need to discuss how EOS is supported. In
>>>
>>> particular, how
>>>
>>> is
>>>
>>> the
>>>
>>> producer state integrated with the remote storage. 43.3
>>>
>>> Now that
>>>
>>> KIP-392
>>>
>>> (allow consumers to fetch from closest replica) is
>>>
>>> implemented,
>>>
>>> we
>>>
>>> need
>>>
>>> to
>>>
>>> discuss how reading from a follower replica is supported
>>>
>>> with
>>>
>>> tier
>>>
>>> storage.
>>>
>>> 43.4 We need to discuss how JBOD is supported with tier
>>>
>>> storage.
>>>
>>> Thanks,
>>>
>>> Jun
>>>
>>> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley <
>>>
>>> tbent...@redhat.com
>>>
>>> wrote:
>>>
>>> Thanks for those insights Ying.
>>>
>>> On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng
>>>
>>> <yi...@uber.com.invalid
>>>
>>> wrote:
>>>
>>> Thanks, I missed that point. However, there's still a
>>>
>>> point at
>>>
>>> which
>>>
>>> the
>>>
>>> consumer fetches start getting served from remote storage
>>>
>>> (even
>>>
>>> if
>>>
>>> that
>>>
>>> point isn't as soon as the local log retention time/size).
>>>
>>> This
>>>
>>> represents
>>>
>>> a kind of performance cliff edge and what I'm really
>>>
>>> interested
>>>
>>> in
>>>
>>> is
>>>
>>> how
>>>
>>> easy it is for a consumer which falls off that cliff to
>>>
>>> catch up
>>>
>>> and so
>>>
>>> its
>>>
>>> fetches again come from local storage. Obviously this can
>>>
>>> depend
>>>
>>> on
>>>
>>> all
>>>
>>> sorts of factors (like production rate, consumption rate),
>>>
>>> so
>>>
>>> it's
>>>
>>> not
>>>
>>> guaranteed (just like it's not guaranteed for Kafka
>>>
>>> today), but
>>>
>>> this
>>>
>>> would
>>>
>>> represent a new failure mode.
>>>
>>> As I have explained in the last mail, it's a very rare
>>>
>>> case that
>>>
>>> a
>>>
>>> consumer
>>> need to read remote data. With our experience at Uber,
>>>
>>> this only
>>>
>>> happens
>>>
>>> when the consumer service had an outage for several hours.
>>>
>>> There is not a "performance cliff" as you assume. The
>>>
>>> remote
>>>
>>> storage
>>>
>>> is
>>>
>>> even faster than local disks in terms of bandwidth.
>>>
>>> Reading from
>>>
>>> remote
>>>
>>> storage is going to have higher latency than local disk.
>>>
>>> But
>>>
>>> since
>>>
>>> the
>>>
>>> consumer
>>> is catching up several hours data, it's not sensitive to
>>>
>>> the
>>>
>>> sub-second
>>>
>>> level
>>> latency, and each remote read request will read a large
>>>
>>> amount of
>>>
>>> data to
>>>
>>> make the overall performance better than reading from local
>>>
>>> disks.
>>>
>>> Another aspect I'd like to understand better is the effect
>>>
>>> of
>>>
>>> serving
>>>
>>> fetch
>>>
>>> request from remote storage has on the broker's network
>>>
>>> utilization. If
>>>
>>> we're just trimming the amount of data held locally
>>>
>>> (without
>>>
>>> increasing
>>>
>>> the
>>>
>>> overall local+remote retention), then we're effectively
>>>
>>> trading
>>>
>>> disk
>>>
>>> bandwidth for network bandwidth when serving fetch
>>>
>>> requests from
>>>
>>> remote
>>>
>>> storage (which I understand to be a good thing, since
>>>
>>> brokers are
>>>
>>> often/usually disk bound). But if we're increasing the
>>>
>>> overall
>>>
>>> local+remote
>>>
>>> retention then it's more likely that network itself
>>>
>>> becomes the
>>>
>>> bottleneck.
>>>
>>> I appreciate this is all rather hand wavy, I'm just trying
>>>
>>> to
>>>
>>> understand
>>>
>>> how this would affect broker performance, so I'd be
>>>
>>> grateful for
>>>
>>> any
>>>
>>> insights you can offer.
>>>
>>> Network bandwidth is a function of produce speed, it has
>>>
>>> nothing
>>>
>>> to
>>>
>>> do
>>>
>>> with
>>>
>>> remote retention. As long as the data is shipped to remote
>>>
>>> storage,
>>>
>>> you
>>>
>>> can
>>>
>>> keep the data there for 1 day or 1 year or 100 years, it
>>>
>>> doesn't
>>>
>>> consume
>>>
>>> any
>>> network resources.
>>>
>>>
>>

Re: [DISCUSS] KIP-405: Kafka Tiered Storage

Reply via email to