Hi Jun, Please go through the wiki which has the latest updates. Google doc is updated frequently to be in sync with wiki.
Thanks, Satish. On Tue, May 19, 2020 at 12:30 AM Jun Rao <j...@confluent.io> wrote: > Hi, Satish, > > Thanks for the update. Just to clarify. Which doc has the latest updates, > the wiki or the google doc? > > Jun > > On Thu, May 14, 2020 at 10:38 AM Satish Duggana <satish.dugg...@gmail.com> > wrote: > >> Hi Jun, >> Thanks for your comments. We updated the KIP with more details. >> >> >100. For each of the operations related to tiering, it would be useful >> to provide a description on how it works with the new API. These include >> things like consumer fetch, replica fetch, offsetForTimestamp, retention >> (remote and local) by size, time and logStartOffset, topic deletion, etc. >> This will tell us if the proposed APIs are sufficient. >> >> We addressed most of these APIs in the KIP. We can add more details if >> needed. >> >> >101. For the default implementation based on internal topic, is it meant >> as a proof of concept or for production usage? I assume that it's the >> former. However, if it's the latter, then the KIP needs to describe the >> design in more detail. >> >> It is production usage as was mentioned in an earlier mail. We plan to >> update this section in the next few days. >> >> >102. When tiering a segment, the segment is first written to the object >> store and then its metadata is written to RLMM using the api "void >> putRemoteLogSegmentData()". >> One potential issue with this approach is that if the system fails after >> the first operation, it leaves a garbage in the object store that's never >> reclaimed. One way to improve this is to have two separate APIs, sth like >> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData(). >> >> That is a good point. We currently have a different way using markers in >> the segment but your suggestion is much better. >> >> >103. It seems that the transactional support and the ability to read >> from follower are missing. >> >> KIP is updated with transactional support, follower fetch semantics, and >> reading from a follower. >> >> >104. It would be useful to provide a testing plan for this KIP. >> >> We added a few tests by introducing test util for tiered storage in the >> PR. We will provide the testing plan in the next few days. >> >> Thanks, >> Satish. >> >> >> On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani <ka...@harsha.io> >> wrote: >> >>> >>> >>> >>> >>> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao <j...@confluent.io> wrote: >>> >>>> Hi, Satish, >>>> >>>> Thanks for the updated doc. The new API seems to be an improvement >>>> overall. A few more comments below. >>>> >>>> 100. For each of the operations related to tiering, it would be useful >>>> to provide a description on how it works with the new API. These include >>>> things like consumer fetch, replica fetch, offsetForTimestamp, retention >>>> (remote and local) by size, time and logStartOffset, topic deletion, >>>> etc. This will tell us if the proposed APIs are sufficient. >>>> >>> >>> Thanks for the feedback Jun. We will add more details around this. >>> >>> >>>> 101. For the default implementation based on internal topic, is it >>>> meant as a proof of concept or for production usage? I assume that it's the >>>> former. However, if it's the latter, then the KIP needs to describe the >>>> design in more detail. >>>> >>> >>> Yes it meant to be for production use. Ideally it would be good to >>> merge this in as the default implementation for metadata service. We can >>> add more details around design and testing. >>> >>> 102. When tiering a segment, the segment is first written to the object >>>> store and then its metadata is written to RLMM using the api "void >>>> putRemoteLogSegmentData()". >>>> One potential issue with this approach is that if the system fails >>>> after the first operation, it leaves a garbage in the object store that's >>>> never reclaimed. One way to improve this is to have two separate APIs, sth >>>> like preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData(). >>>> >>>> 103. It seems that the transactional support and the ability to read >>>> from follower are missing. >>>> >>>> 104. It would be useful to provide a testing plan for this KIP. >>>> >>> >>> We are working on adding more details around transactional support and >>> coming up with test plan. >>> Add system tests and integration tests. >>> >>> Thanks, >>>> >>>> Jun >>>> >>>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana < >>>> satish.dugg...@gmail.com> wrote: >>>> >>>> Hi Jun, >>>> Please look at the earlier reply and let us know your comments. >>>> >>>> Thanks, >>>> Satish. >>>> >>>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana < >>>> satish.dugg...@gmail.com> wrote: >>>> >>>> Hi Jun, >>>> Thanks for your comments on the separation of remote log metadata >>>> storage and remote log storage. >>>> We had a few discussions since early Jan on how to support eventually >>>> consistent stores like S3 by uncoupling remote log segment metadata and >>>> remote log storage. It is written with details in the doc here(1). Below is >>>> the brief summary of the discussion from that doc. >>>> >>>> The current approach consists of pulling the remote log segment >>>> metadata from remote log storage APIs. It worked fine for storages like >>>> HDFS. But one of the problems of relying on the remote storage to maintain >>>> metadata is that tiered-storage needs to be strongly consistent, with an >>>> impact not only on the metadata(e.g. LIST in S3) but also on the segment >>>> data(e.g. GET after a DELETE in S3). The cost of maintaining metadata in >>>> remote storage needs to be factored in. This is true in the case of S3, >>>> LIST APIs incur huge costs as you raised earlier. >>>> So, it is good to separate the remote storage from the remote log >>>> metadata store. We refactored the existing RemoteStorageManager and >>>> introduced RemoteLogMetadataManager. Remote log metadata store should give >>>> strong consistency semantics but remote log storage can be eventually >>>> consistent. >>>> We can have a default implementation for RemoteLogMetadataManager which >>>> uses an internal topic(as mentioned in one of our earlier emails) as >>>> storage. But users can always plugin their own RemoteLogMetadataManager >>>> implementation based on their environment. >>>> >>>> Please go through the updated KIP and let us know your comments. We >>>> have started refactoring for the changes mentioned in the KIP and there may >>>> be a few more updates to the APIs. >>>> >>>> [1] >>>> >>>> >>>> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7# >>>> >>>> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko < >>>> ivan0yurche...@gmail.com> >>>> >>>> wrote: >>>> >>>> Hi all, >>>> >>>> Jun: >>>> >>>> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If >>>> >>>> you >>>> >>>> have 100,000 partitions and want to pull the metadata for each >>>> >>>> partition >>>> >>>> at >>>> >>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per >>>> >>>> day. >>>> >>>> I want to note here, that no reasonably durable storage will be cheap >>>> at 100k RPS. For example, DynamoDB might give the same ballpark >>>> >>>> figures. >>>> >>>> If we want to keep the pull-based approach, we can try to reduce this >>>> >>>> number >>>> >>>> in several ways: doing listings less frequently (as Satish mentioned, >>>> with the current defaults it's ~3.33k RPS for your example), batching >>>> listing operations in some way (depending on the storage; it might require >>>> the change of RSM's interface). >>>> >>>> There are different ways for doing push based metadata propagation. >>>> >>>> Some >>>> >>>> object stores may support that already. For example, S3 supports >>>> >>>> events >>>> >>>> notification >>>> >>>> This sounds interesting. However, I see a couple of issues using it: >>>> 1. As I understand the documentation, notification delivery is not >>>> guaranteed >>>> and it's recommended to periodically do LIST to fill the gaps. Which >>>> brings us back to the same LIST consistency guarantees issue. >>>> 2. The same goes for the broker start: to get the current state, we >>>> >>>> need >>>> >>>> to LIST. >>>> 3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS >>>> >>>> aren't >>>> >>>> designed for such a case. >>>> >>>> Alexandre: >>>> >>>> A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM >>>> >>>> cannot >>>> >>>> relies solely on S3 APIs to guarantee the expected strong >>>> >>>> consistency. The >>>> >>>> proposed implementation [3] would need to be updated to take this >>>> >>>> into >>>> >>>> account. Let’s talk more about this. >>>> >>>> Thank you for the feedback. I clearly see the need for changing the S3 >>>> implementation >>>> to provide stronger consistency guarantees. As it see from this thread, >>>> there are >>>> several possible approaches to this. Let's discuss RemoteLogManager's >>>> contract and >>>> behavior (like pull vs push model) further before picking one (or >>>> >>>> several - >>>> >>>> ?) of them. >>>> I'm going to do some evaluation of DynamoDB for the pull-based >>>> >>>> approach, >>>> >>>> if it's possible to apply it paying a reasonable bill. Also, of the >>>> push-based approach >>>> with a Kafka topic as the medium. >>>> >>>> A.2.3 Atomicity – what does an implementation of RSM need to provide >>>> >>>> with >>>> >>>> respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and >>>> deleteTopicPartition? If a partial failure happens in any of those >>>> >>>> (e.g. >>>> >>>> in >>>> >>>> the S3 implementation, if one of the multiple uploads fails [4]), >>>> >>>> The S3 implementation is going to change, but it's worth clarifying >>>> >>>> anyway. >>>> >>>> The segment log file is being uploaded after S3 has acked uploading of >>>> all other files associated with the segment and only after this the >>>> >>>> whole >>>> >>>> segment file set becomes visible remotely for operations like >>>> listRemoteSegments [1]. >>>> In case of upload failure, the files that has been successfully >>>> >>>> uploaded >>>> >>>> stays >>>> as invisible garbage that is collected by cleanupLogUntil (or >>>> >>>> overwritten >>>> >>>> successfully later). >>>> And the opposite happens during the deletion: log files are deleted >>>> >>>> first. >>>> >>>> This approach should generally work when we solve consistency issues by >>>> adding a strongly consistent storage: a segment's uploaded files >>>> >>>> remain >>>> >>>> invisible garbage until some metadata about them is written. >>>> >>>> A.3 Caching – storing locally the segments retrieved from the remote >>>> storage is excluded as it does not align with the original intent >>>> >>>> and even >>>> >>>> defeat some of its purposes (save disk space etc.). That said, could >>>> >>>> there >>>> >>>> be other types of use cases where the pattern of access to the >>>> >>>> remotely >>>> >>>> stored segments would benefit from local caching (and potentially >>>> read-ahead)? Consider the use case of a large pool of consumers which >>>> >>>> start >>>> >>>> a backfill at the same time for one day worth of data from one year >>>> >>>> ago >>>> >>>> stored remotely. Caching the segments locally would allow to >>>> >>>> uncouple the >>>> >>>> load on the remote storage from the load on the Kafka cluster. Maybe >>>> >>>> the >>>> >>>> RLM could expose a configuration parameter to switch that feature >>>> >>>> on/off? >>>> >>>> I tend to agree here, caching remote segments locally and making this >>>> configurable sounds pretty practical to me. We should implement >>>> >>>> this, >>>> >>>> maybe not in the first iteration. >>>> >>>> Br, >>>> Ivan >>>> >>>> [1] >>>> >>>> >>>> https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152 >>>> >>>> On Thu, 19 Dec 2019 at 19:49, Alexandre Dupriez < >>>> >>>> alexandre.dupr...@gmail.com> >>>> >>>> wrote: >>>> >>>> Hi Jun, >>>> >>>> Thank you for the feedback. I am trying to understand how a >>>> >>>> push-based >>>> >>>> approach would work. >>>> In order for the metadata to be propagated (under the assumption you >>>> stated), would you plan to add a new API in Kafka to allow the metadata >>>> store to send them directly to the brokers? >>>> >>>> Thanks, >>>> Alexandre >>>> >>>> Le mer. 18 déc. 2019 à 20:14, Jun Rao <j...@confluent.io> a écrit : >>>> >>>> Hi, Satish and Ivan, >>>> >>>> There are different ways for doing push based metadata >>>> >>>> propagation. Some >>>> >>>> object stores may support that already. For example, S3 supports >>>> >>>> events >>>> >>>> notification ( >>>> >>>> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html). >>>> >>>> >>>> Otherwise one could use a separate metadata store that supports >>>> >>>> push-based >>>> >>>> change propagation. Other people have mentioned using a Kafka >>>> >>>> topic. The >>>> >>>> best approach may depend on the object store and the operational >>>> environment (e.g. whether an external metadata store is already >>>> >>>> available). >>>> >>>> The above discussion is based on the assumption that we need to >>>> >>>> cache the >>>> >>>> object metadata locally in every broker. I mentioned earlier that >>>> >>>> an >>>> >>>> alternative is to just store/retrieve those metadata in an external >>>> metadata store. That may simplify the implementation in some cases. >>>> >>>> Thanks, >>>> >>>> Jun >>>> >>>> On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana < >>>> >>>> satish.dugg...@gmail.com> >>>> >>>> wrote: >>>> >>>> Hi Jun, >>>> Thanks for your reply. >>>> >>>> Currently, `listRemoteSegments` is called at the configured >>>> interval(not every second, defaults to 30secs). Storing remote >>>> >>>> log >>>> >>>> metadata in a strongly consistent store for S3 RSM is raised in >>>> PR-comment[1]. >>>> RLM invokes RSM at regular intervals and RSM can give remote >>>> >>>> segment >>>> >>>> metadata if it is available. RSM is responsible for maintaining >>>> >>>> and >>>> >>>> fetching those entries. It should be based on whatever mechanism >>>> >>>> is >>>> >>>> consistent and efficient with the respective remote storage. >>>> >>>> Can you give more details about push based mechanism from RSM? >>>> >>>> 1. >>>> >>>> https://github.com/apache/kafka/pull/7561#discussion_r344576223 >>>> >>>> Thanks, >>>> Satish. >>>> >>>> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao <j...@confluent.io> wrote: >>>> >>>> Hi, Harsha, >>>> >>>> Thanks for the reply. >>>> >>>> 40/41. I am curious which block storages you have tested. S3 >>>> >>>> seems >>>> >>>> to be >>>> >>>> one of the popular block stores. The concerns that I have with >>>> >>>> pull >>>> >>>> based >>>> >>>> approach are the following. >>>> (a) Cost: S3 list object requests cost $0.005 per 1000 >>>> >>>> requests. If >>>> >>>> you >>>> >>>> have 100,000 partitions and want to pull the metadata for each >>>> >>>> partition >>>> >>>> at >>>> >>>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K >>>> >>>> per >>>> >>>> day. >>>> >>>> (b) Semantics: S3 list objects are eventually consistent. So, >>>> >>>> when >>>> >>>> you >>>> >>>> do a >>>> >>>> list object request, there is no guarantee that you can see all >>>> >>>> uploaded >>>> >>>> objects. This could impact the correctness of subsequent >>>> >>>> logics. >>>> >>>> (c) Efficiency: Blindly pulling metadata when there is no >>>> >>>> change adds >>>> >>>> unnecessary overhead in the broker as well as in the block >>>> >>>> store. >>>> >>>> So, have you guys tested S3? If so, could you share your >>>> >>>> experience >>>> >>>> in >>>> >>>> terms of cost, semantics and efficiency? >>>> >>>> Jun >>>> >>>> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani < >>>> >>>> ka...@harsha.io >>>> >>>> wrote: >>>> >>>> Hi Jun, >>>> Thanks for the reply. >>>> >>>> On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao <j...@confluent.io> >>>> >>>> wrote: >>>> >>>> Hi, Satish and Ying, >>>> >>>> Thanks for the reply. >>>> >>>> 40/41. There are two different ways that we can approach >>>> >>>> this. >>>> >>>> One is >>>> >>>> what >>>> >>>> you said. We can have an opinionated way of storing and >>>> >>>> populating >>>> >>>> the >>>> >>>> tier >>>> >>>> metadata that we think is good enough for everyone. I am >>>> >>>> not >>>> >>>> sure if >>>> >>>> this >>>> >>>> is the case based on what's currently proposed in the KIP. >>>> >>>> For >>>> >>>> example, I >>>> >>>> am not sure that (1) everyone always needs local metadata; >>>> >>>> (2) >>>> >>>> the >>>> >>>> current >>>> >>>> local storage format is general enough and (3) everyone >>>> >>>> wants to >>>> >>>> use >>>> >>>> the >>>> >>>> pull based approach to propagate the metadata. Another >>>> >>>> approach >>>> >>>> is to >>>> >>>> make >>>> >>>> this pluggable and let the implementor implements the best >>>> >>>> approach >>>> >>>> for a >>>> >>>> particular block storage. I haven't seen any comments from >>>> >>>> Slack/AirBnb >>>> >>>> in >>>> >>>> the mailing list on this topic. It would be great if they >>>> >>>> can >>>> >>>> provide >>>> >>>> feedback directly here. >>>> >>>> The current interfaces are designed with most popular block >>>> >>>> storages >>>> >>>> available today and we did 2 implementations with these >>>> >>>> interfaces and >>>> >>>> they both are yielding good results as we going through the >>>> >>>> testing of >>>> >>>> it. >>>> >>>> If there is ever a need for pull based approach we can >>>> >>>> definitely >>>> >>>> evolve >>>> >>>> the interface. >>>> In the past we did mark interfaces to be evolving to make >>>> >>>> room for >>>> >>>> unknowns >>>> >>>> in the future. >>>> If you have any suggestions around the current interfaces >>>> >>>> please >>>> >>>> propose we >>>> >>>> are happy to see if we can work them into it. >>>> >>>> 43. To offer tier storage as a general feature, ideally all >>>> >>>> existing >>>> >>>> capabilities should still be supported. It's fine if the >>>> >>>> uber >>>> >>>> implementation doesn't support all capabilities for >>>> >>>> internal >>>> >>>> usage. >>>> >>>> However, the framework should be general enough. >>>> >>>> We agree on that as a principle. But all of these major >>>> >>>> features >>>> >>>> mostly >>>> >>>> coming right now and to have a new big feature such as tiered >>>> >>>> storage >>>> >>>> to >>>> >>>> support all the new features will be a big ask. We can >>>> >>>> document on >>>> >>>> how >>>> >>>> do >>>> >>>> we approach solving these in future iterations. >>>> Our goal is to make this tiered storage feature work for >>>> >>>> everyone. >>>> >>>> 43.3 This is more than just serving the tier-ed data from >>>> >>>> block >>>> >>>> storage. >>>> >>>> With KIP-392, the consumer now can resolve the conflicts >>>> >>>> with the >>>> >>>> replica >>>> >>>> based on leader epoch. So, we need to make sure that >>>> >>>> leader epoch >>>> >>>> can be >>>> >>>> recovered properly from tier storage. >>>> >>>> We are working on testing our approach and we will update >>>> >>>> the KIP >>>> >>>> with >>>> >>>> design details. >>>> >>>> 43.4 For JBOD, if tier storage stores the tier metadata >>>> >>>> locally, we >>>> >>>> need to >>>> >>>> support moving such metadata across disk directories since >>>> >>>> JBOD >>>> >>>> supports >>>> >>>> moving data across disks. >>>> >>>> KIP is updated with JBOD details. Having said that JBOD >>>> >>>> tooling >>>> >>>> needs >>>> >>>> to >>>> >>>> evolve to support production loads. Most of the users will be >>>> >>>> interested in >>>> >>>> using tiered storage without JBOD support support on day 1. >>>> >>>> Thanks, >>>> Harsha >>>> >>>> As for meeting, we could have a KIP e-meeting on this if >>>> >>>> needed, >>>> >>>> but it >>>> >>>> will be open to everyone and will be recorded and shared. >>>> >>>> Often, >>>> >>>> the >>>> >>>> details are still resolved through the mailing list. >>>> >>>> Jun >>>> >>>> On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng >>>> >>>> <yi...@uber.com.invalid> >>>> >>>> wrote: >>>> >>>> Please ignore my previous email >>>> I didn't know Apache requires all the discussions to be >>>> >>>> "open" >>>> >>>> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com> >>>> >>>> wrote: >>>> >>>> Hi Jun, >>>> >>>> Thank you very much for your feedback! >>>> >>>> Can we schedule a meeting in your Palo Alto office in >>>> >>>> December? I >>>> >>>> think a >>>> >>>> face to face discussion is much more efficient than >>>> >>>> emails. Both >>>> >>>> Harsha >>>> >>>> and >>>> >>>> I can visit you. Satish may be able to join us remotely. >>>> >>>> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io >>>> >>>> wrote: >>>> >>>> Hi, Satish and Harsha, >>>> >>>> The following is a more detailed high level feedback for >>>> >>>> the KIP. >>>> >>>> Overall, >>>> >>>> the KIP seems useful. The challenge is how to design it >>>> >>>> such that >>>> >>>> it’s >>>> >>>> general enough to support different ways of implementing >>>> >>>> this >>>> >>>> feature >>>> >>>> and >>>> >>>> support existing features. >>>> >>>> 40. Local segment metadata storage: The KIP makes the >>>> >>>> assumption >>>> >>>> that >>>> >>>> the >>>> >>>> metadata for the archived log segments are cached locally >>>> >>>> in >>>> >>>> every >>>> >>>> broker >>>> >>>> and provides a specific implementation for the local >>>> >>>> storage in >>>> >>>> the >>>> >>>> framework. We probably should discuss this more. For >>>> >>>> example, >>>> >>>> some >>>> >>>> tier >>>> >>>> storage providers may not want to cache the metadata >>>> >>>> locally and >>>> >>>> just >>>> >>>> rely >>>> >>>> upon a remote key/value store if such a store is already >>>> >>>> present. If >>>> >>>> a >>>> >>>> local store is used, there could be different ways of >>>> >>>> implementing it >>>> >>>> (e.g., based on customized local files, an embedded local >>>> >>>> store >>>> >>>> like >>>> >>>> RocksDB, etc). An alternative of designing this is to just >>>> >>>> provide an >>>> >>>> interface for retrieving the tier segment metadata and >>>> >>>> leave the >>>> >>>> details >>>> >>>> of >>>> >>>> how to get the metadata outside of the framework. >>>> >>>> 41. RemoteStorageManager interface and the usage of the >>>> >>>> interface in >>>> >>>> the >>>> >>>> framework: I am not sure if the interface is general >>>> >>>> enough. For >>>> >>>> example, >>>> >>>> it seems that RemoteLogIndexEntry is tied to a specific >>>> >>>> way of >>>> >>>> storing >>>> >>>> the >>>> >>>> metadata in remote storage. The framework uses >>>> >>>> listRemoteSegments() >>>> >>>> api >>>> >>>> in >>>> >>>> a pull based approach. However, in some other >>>> >>>> implementations, a >>>> >>>> push >>>> >>>> based >>>> approach may be more preferred. I don’t have a concrete >>>> >>>> proposal >>>> >>>> yet. >>>> >>>> But, >>>> >>>> it would be useful to give this area some more thoughts >>>> >>>> and see >>>> >>>> if we >>>> >>>> can >>>> >>>> make the interface more general. >>>> >>>> 42. In the diagram, the RemoteLogManager is side by side >>>> >>>> with >>>> >>>> LogManager. >>>> >>>> This KIP only discussed how the fetch request is handled >>>> >>>> between >>>> >>>> the >>>> >>>> two >>>> >>>> layer. However, we should also consider how other requests >>>> >>>> that >>>> >>>> touch >>>> >>>> the >>>> >>>> log can be handled. e.g., list offsets by timestamp, delete >>>> >>>> records, >>>> >>>> etc. >>>> >>>> Also, in this model, it's not clear which component is >>>> >>>> responsible >>>> >>>> for >>>> >>>> managing the log start offset. It seems that the log start >>>> >>>> offset >>>> >>>> could >>>> >>>> be >>>> >>>> changed by both RemoteLogManager and LogManager. >>>> >>>> 43. There are quite a few existing features not covered by >>>> >>>> the >>>> >>>> KIP. >>>> >>>> It >>>> >>>> would be useful to discuss each of those. >>>> 43.1 I won’t say that compacted topics are rarely used and >>>> >>>> always >>>> >>>> small. >>>> >>>> For example, KStreams uses compacted topics for storing the >>>> >>>> states >>>> >>>> and >>>> >>>> sometimes the size of the topic could be large. While it >>>> >>>> might >>>> >>>> be ok >>>> >>>> to >>>> >>>> not >>>> >>>> support compacted topics initially, it would be useful to >>>> >>>> have a >>>> >>>> high >>>> >>>> level >>>> idea on how this might be supported down the road so that >>>> >>>> we >>>> >>>> don’t >>>> >>>> have >>>> >>>> to >>>> >>>> make incompatible API changes in the future. >>>> 43.2 We need to discuss how EOS is supported. In >>>> >>>> particular, how >>>> >>>> is >>>> >>>> the >>>> >>>> producer state integrated with the remote storage. 43.3 >>>> >>>> Now that >>>> >>>> KIP-392 >>>> >>>> (allow consumers to fetch from closest replica) is >>>> >>>> implemented, >>>> >>>> we >>>> >>>> need >>>> >>>> to >>>> >>>> discuss how reading from a follower replica is supported >>>> >>>> with >>>> >>>> tier >>>> >>>> storage. >>>> >>>> 43.4 We need to discuss how JBOD is supported with tier >>>> >>>> storage. >>>> >>>> Thanks, >>>> >>>> Jun >>>> >>>> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley < >>>> >>>> tbent...@redhat.com >>>> >>>> wrote: >>>> >>>> Thanks for those insights Ying. >>>> >>>> On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng >>>> >>>> <yi...@uber.com.invalid >>>> >>>> wrote: >>>> >>>> Thanks, I missed that point. However, there's still a >>>> >>>> point at >>>> >>>> which >>>> >>>> the >>>> >>>> consumer fetches start getting served from remote storage >>>> >>>> (even >>>> >>>> if >>>> >>>> that >>>> >>>> point isn't as soon as the local log retention time/size). >>>> >>>> This >>>> >>>> represents >>>> >>>> a kind of performance cliff edge and what I'm really >>>> >>>> interested >>>> >>>> in >>>> >>>> is >>>> >>>> how >>>> >>>> easy it is for a consumer which falls off that cliff to >>>> >>>> catch up >>>> >>>> and so >>>> >>>> its >>>> >>>> fetches again come from local storage. Obviously this can >>>> >>>> depend >>>> >>>> on >>>> >>>> all >>>> >>>> sorts of factors (like production rate, consumption rate), >>>> >>>> so >>>> >>>> it's >>>> >>>> not >>>> >>>> guaranteed (just like it's not guaranteed for Kafka >>>> >>>> today), but >>>> >>>> this >>>> >>>> would >>>> >>>> represent a new failure mode. >>>> >>>> As I have explained in the last mail, it's a very rare >>>> >>>> case that >>>> >>>> a >>>> >>>> consumer >>>> need to read remote data. With our experience at Uber, >>>> >>>> this only >>>> >>>> happens >>>> >>>> when the consumer service had an outage for several hours. >>>> >>>> There is not a "performance cliff" as you assume. The >>>> >>>> remote >>>> >>>> storage >>>> >>>> is >>>> >>>> even faster than local disks in terms of bandwidth. >>>> >>>> Reading from >>>> >>>> remote >>>> >>>> storage is going to have higher latency than local disk. >>>> >>>> But >>>> >>>> since >>>> >>>> the >>>> >>>> consumer >>>> is catching up several hours data, it's not sensitive to >>>> >>>> the >>>> >>>> sub-second >>>> >>>> level >>>> latency, and each remote read request will read a large >>>> >>>> amount of >>>> >>>> data to >>>> >>>> make the overall performance better than reading from local >>>> >>>> disks. >>>> >>>> Another aspect I'd like to understand better is the effect >>>> >>>> of >>>> >>>> serving >>>> >>>> fetch >>>> >>>> request from remote storage has on the broker's network >>>> >>>> utilization. If >>>> >>>> we're just trimming the amount of data held locally >>>> >>>> (without >>>> >>>> increasing >>>> >>>> the >>>> >>>> overall local+remote retention), then we're effectively >>>> >>>> trading >>>> >>>> disk >>>> >>>> bandwidth for network bandwidth when serving fetch >>>> >>>> requests from >>>> >>>> remote >>>> >>>> storage (which I understand to be a good thing, since >>>> >>>> brokers are >>>> >>>> often/usually disk bound). But if we're increasing the >>>> >>>> overall >>>> >>>> local+remote >>>> >>>> retention then it's more likely that network itself >>>> >>>> becomes the >>>> >>>> bottleneck. >>>> >>>> I appreciate this is all rather hand wavy, I'm just trying >>>> >>>> to >>>> >>>> understand >>>> >>>> how this would affect broker performance, so I'd be >>>> >>>> grateful for >>>> >>>> any >>>> >>>> insights you can offer. >>>> >>>> Network bandwidth is a function of produce speed, it has >>>> >>>> nothing >>>> >>>> to >>>> >>>> do >>>> >>>> with >>>> >>>> remote retention. As long as the data is shipped to remote >>>> >>>> storage, >>>> >>>> you >>>> >>>> can >>>> >>>> keep the data there for 1 day or 1 year or 100 years, it >>>> >>>> doesn't >>>> >>>> consume >>>> >>>> any >>>> network resources. >>>> >>>> >>>