Hi, Satish, Thanks for the update. Just to clarify. Which doc has the latest updates, the wiki or the google doc?
Jun On Thu, May 14, 2020 at 10:38 AM Satish Duggana <satish.dugg...@gmail.com> wrote: > Hi Jun, > Thanks for your comments. We updated the KIP with more details. > > >100. For each of the operations related to tiering, it would be useful to > provide a description on how it works with the new API. These include > things like consumer fetch, replica fetch, offsetForTimestamp, retention > (remote and local) by size, time and logStartOffset, topic deletion, etc. > This will tell us if the proposed APIs are sufficient. > > We addressed most of these APIs in the KIP. We can add more details if > needed. > > >101. For the default implementation based on internal topic, is it meant > as a proof of concept or for production usage? I assume that it's the > former. However, if it's the latter, then the KIP needs to describe the > design in more detail. > > It is production usage as was mentioned in an earlier mail. We plan to > update this section in the next few days. > > >102. When tiering a segment, the segment is first written to the object > store and then its metadata is written to RLMM using the api "void > putRemoteLogSegmentData()". > One potential issue with this approach is that if the system fails after > the first operation, it leaves a garbage in the object store that's never > reclaimed. One way to improve this is to have two separate APIs, sth like > preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData(). > > That is a good point. We currently have a different way using markers in > the segment but your suggestion is much better. > > >103. It seems that the transactional support and the ability to read > from follower are missing. > > KIP is updated with transactional support, follower fetch semantics, and > reading from a follower. > > >104. It would be useful to provide a testing plan for this KIP. > > We added a few tests by introducing test util for tiered storage in the > PR. We will provide the testing plan in the next few days. > > Thanks, > Satish. > > > On Wed, Feb 26, 2020 at 9:43 PM Harsha Chintalapani <ka...@harsha.io> > wrote: > >> >> >> >> >> On Tue, Feb 25, 2020 at 12:46 PM, Jun Rao <j...@confluent.io> wrote: >> >>> Hi, Satish, >>> >>> Thanks for the updated doc. The new API seems to be an improvement >>> overall. A few more comments below. >>> >>> 100. For each of the operations related to tiering, it would be useful >>> to provide a description on how it works with the new API. These include >>> things like consumer fetch, replica fetch, offsetForTimestamp, retention >>> (remote and local) by size, time and logStartOffset, topic deletion, >>> etc. This will tell us if the proposed APIs are sufficient. >>> >> >> Thanks for the feedback Jun. We will add more details around this. >> >> >>> 101. For the default implementation based on internal topic, is it meant >>> as a proof of concept or for production usage? I assume that it's the >>> former. However, if it's the latter, then the KIP needs to describe the >>> design in more detail. >>> >> >> Yes it meant to be for production use. Ideally it would be good to merge >> this in as the default implementation for metadata service. We can add more >> details around design and testing. >> >> 102. When tiering a segment, the segment is first written to the object >>> store and then its metadata is written to RLMM using the api "void >>> putRemoteLogSegmentData()". >>> One potential issue with this approach is that if the system fails after >>> the first operation, it leaves a garbage in the object store that's never >>> reclaimed. One way to improve this is to have two separate APIs, sth like >>> preparePutRemoteLogSegmentData() and commitPutRemoteLogSegmentData(). >>> >>> 103. It seems that the transactional support and the ability to read >>> from follower are missing. >>> >>> 104. It would be useful to provide a testing plan for this KIP. >>> >> >> We are working on adding more details around transactional support and >> coming up with test plan. >> Add system tests and integration tests. >> >> Thanks, >>> >>> Jun >>> >>> On Mon, Feb 24, 2020 at 8:10 AM Satish Duggana <satish.dugg...@gmail.com> >>> wrote: >>> >>> Hi Jun, >>> Please look at the earlier reply and let us know your comments. >>> >>> Thanks, >>> Satish. >>> >>> On Wed, Feb 12, 2020 at 4:06 PM Satish Duggana <satish.dugg...@gmail.com> >>> wrote: >>> >>> Hi Jun, >>> Thanks for your comments on the separation of remote log metadata >>> storage and remote log storage. >>> We had a few discussions since early Jan on how to support eventually >>> consistent stores like S3 by uncoupling remote log segment metadata and >>> remote log storage. It is written with details in the doc here(1). Below is >>> the brief summary of the discussion from that doc. >>> >>> The current approach consists of pulling the remote log segment metadata >>> from remote log storage APIs. It worked fine for storages like HDFS. But >>> one of the problems of relying on the remote storage to maintain metadata >>> is that tiered-storage needs to be strongly consistent, with an impact not >>> only on the metadata(e.g. LIST in S3) but also on the segment data(e.g. GET >>> after a DELETE in S3). The cost of maintaining metadata in remote storage >>> needs to be factored in. This is true in the case of S3, LIST APIs incur >>> huge costs as you raised earlier. >>> So, it is good to separate the remote storage from the remote log >>> metadata store. We refactored the existing RemoteStorageManager and >>> introduced RemoteLogMetadataManager. Remote log metadata store should give >>> strong consistency semantics but remote log storage can be eventually >>> consistent. >>> We can have a default implementation for RemoteLogMetadataManager which >>> uses an internal topic(as mentioned in one of our earlier emails) as >>> storage. But users can always plugin their own RemoteLogMetadataManager >>> implementation based on their environment. >>> >>> Please go through the updated KIP and let us know your comments. We have >>> started refactoring for the changes mentioned in the KIP and there may be a >>> few more updates to the APIs. >>> >>> [1] >>> >>> >>> https://docs.google.com/document/d/1qfkBCWL1e7ZWkHU7brxKDBebq4ie9yK20XJnKbgAlew/edit?ts=5e208ec7# >>> >>> On Fri, Dec 27, 2019 at 5:43 PM Ivan Yurchenko <ivan0yurche...@gmail.com> >>> >>> >>> wrote: >>> >>> Hi all, >>> >>> Jun: >>> >>> (a) Cost: S3 list object requests cost $0.005 per 1000 requests. If >>> >>> you >>> >>> have 100,000 partitions and want to pull the metadata for each >>> >>> partition >>> >>> at >>> >>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K per >>> >>> day. >>> >>> I want to note here, that no reasonably durable storage will be cheap at >>> 100k RPS. For example, DynamoDB might give the same ballpark >>> >>> figures. >>> >>> If we want to keep the pull-based approach, we can try to reduce this >>> >>> number >>> >>> in several ways: doing listings less frequently (as Satish mentioned, >>> with the current defaults it's ~3.33k RPS for your example), batching >>> listing operations in some way (depending on the storage; it might require >>> the change of RSM's interface). >>> >>> There are different ways for doing push based metadata propagation. >>> >>> Some >>> >>> object stores may support that already. For example, S3 supports >>> >>> events >>> >>> notification >>> >>> This sounds interesting. However, I see a couple of issues using it: >>> 1. As I understand the documentation, notification delivery is not >>> guaranteed >>> and it's recommended to periodically do LIST to fill the gaps. Which >>> brings us back to the same LIST consistency guarantees issue. >>> 2. The same goes for the broker start: to get the current state, we >>> >>> need >>> >>> to LIST. >>> 3. The dynamic set of multiple consumers (RSMs): AFAIK SQS and SNS >>> >>> aren't >>> >>> designed for such a case. >>> >>> Alexandre: >>> >>> A.1 As commented on PR 7561, S3 consistency model [1][2] implies RSM >>> >>> cannot >>> >>> relies solely on S3 APIs to guarantee the expected strong >>> >>> consistency. The >>> >>> proposed implementation [3] would need to be updated to take this >>> >>> into >>> >>> account. Let’s talk more about this. >>> >>> Thank you for the feedback. I clearly see the need for changing the S3 >>> implementation >>> to provide stronger consistency guarantees. As it see from this thread, >>> there are >>> several possible approaches to this. Let's discuss RemoteLogManager's >>> contract and >>> behavior (like pull vs push model) further before picking one (or >>> >>> several - >>> >>> ?) of them. >>> I'm going to do some evaluation of DynamoDB for the pull-based >>> >>> approach, >>> >>> if it's possible to apply it paying a reasonable bill. Also, of the >>> push-based approach >>> with a Kafka topic as the medium. >>> >>> A.2.3 Atomicity – what does an implementation of RSM need to provide >>> >>> with >>> >>> respect to atomicity of the APIs copyLogSegment, cleanupLogUntil and >>> deleteTopicPartition? If a partial failure happens in any of those >>> >>> (e.g. >>> >>> in >>> >>> the S3 implementation, if one of the multiple uploads fails [4]), >>> >>> The S3 implementation is going to change, but it's worth clarifying >>> >>> anyway. >>> >>> The segment log file is being uploaded after S3 has acked uploading of >>> all other files associated with the segment and only after this the >>> >>> whole >>> >>> segment file set becomes visible remotely for operations like >>> listRemoteSegments [1]. >>> In case of upload failure, the files that has been successfully >>> >>> uploaded >>> >>> stays >>> as invisible garbage that is collected by cleanupLogUntil (or >>> >>> overwritten >>> >>> successfully later). >>> And the opposite happens during the deletion: log files are deleted >>> >>> first. >>> >>> This approach should generally work when we solve consistency issues by >>> adding a strongly consistent storage: a segment's uploaded files >>> >>> remain >>> >>> invisible garbage until some metadata about them is written. >>> >>> A.3 Caching – storing locally the segments retrieved from the remote >>> storage is excluded as it does not align with the original intent >>> >>> and even >>> >>> defeat some of its purposes (save disk space etc.). That said, could >>> >>> there >>> >>> be other types of use cases where the pattern of access to the >>> >>> remotely >>> >>> stored segments would benefit from local caching (and potentially >>> read-ahead)? Consider the use case of a large pool of consumers which >>> >>> start >>> >>> a backfill at the same time for one day worth of data from one year >>> >>> ago >>> >>> stored remotely. Caching the segments locally would allow to >>> >>> uncouple the >>> >>> load on the remote storage from the load on the Kafka cluster. Maybe >>> >>> the >>> >>> RLM could expose a configuration parameter to switch that feature >>> >>> on/off? >>> >>> I tend to agree here, caching remote segments locally and making this >>> configurable sounds pretty practical to me. We should implement >>> >>> this, >>> >>> maybe not in the first iteration. >>> >>> Br, >>> Ivan >>> >>> [1] >>> >>> >>> https://github.com/harshach/kafka/pull/18/files#diff-4d73d01c16caed6f2548fc3063550ef0R152 >>> >>> On Thu, 19 Dec 2019 at 19:49, Alexandre Dupriez < >>> >>> alexandre.dupr...@gmail.com> >>> >>> wrote: >>> >>> Hi Jun, >>> >>> Thank you for the feedback. I am trying to understand how a >>> >>> push-based >>> >>> approach would work. >>> In order for the metadata to be propagated (under the assumption you >>> stated), would you plan to add a new API in Kafka to allow the metadata >>> store to send them directly to the brokers? >>> >>> Thanks, >>> Alexandre >>> >>> Le mer. 18 déc. 2019 à 20:14, Jun Rao <j...@confluent.io> a écrit : >>> >>> Hi, Satish and Ivan, >>> >>> There are different ways for doing push based metadata >>> >>> propagation. Some >>> >>> object stores may support that already. For example, S3 supports >>> >>> events >>> >>> notification ( >>> >>> https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html). >>> >>> >>> Otherwise one could use a separate metadata store that supports >>> >>> push-based >>> >>> change propagation. Other people have mentioned using a Kafka >>> >>> topic. The >>> >>> best approach may depend on the object store and the operational >>> environment (e.g. whether an external metadata store is already >>> >>> available). >>> >>> The above discussion is based on the assumption that we need to >>> >>> cache the >>> >>> object metadata locally in every broker. I mentioned earlier that >>> >>> an >>> >>> alternative is to just store/retrieve those metadata in an external >>> metadata store. That may simplify the implementation in some cases. >>> >>> Thanks, >>> >>> Jun >>> >>> On Thu, Dec 5, 2019 at 7:01 AM Satish Duggana < >>> >>> satish.dugg...@gmail.com> >>> >>> wrote: >>> >>> Hi Jun, >>> Thanks for your reply. >>> >>> Currently, `listRemoteSegments` is called at the configured interval(not >>> every second, defaults to 30secs). Storing remote >>> >>> log >>> >>> metadata in a strongly consistent store for S3 RSM is raised in >>> PR-comment[1]. >>> RLM invokes RSM at regular intervals and RSM can give remote >>> >>> segment >>> >>> metadata if it is available. RSM is responsible for maintaining >>> >>> and >>> >>> fetching those entries. It should be based on whatever mechanism >>> >>> is >>> >>> consistent and efficient with the respective remote storage. >>> >>> Can you give more details about push based mechanism from RSM? >>> >>> 1. >>> >>> https://github.com/apache/kafka/pull/7561#discussion_r344576223 >>> >>> Thanks, >>> Satish. >>> >>> On Thu, Dec 5, 2019 at 4:23 AM Jun Rao <j...@confluent.io> wrote: >>> >>> Hi, Harsha, >>> >>> Thanks for the reply. >>> >>> 40/41. I am curious which block storages you have tested. S3 >>> >>> seems >>> >>> to be >>> >>> one of the popular block stores. The concerns that I have with >>> >>> pull >>> >>> based >>> >>> approach are the following. >>> (a) Cost: S3 list object requests cost $0.005 per 1000 >>> >>> requests. If >>> >>> you >>> >>> have 100,000 partitions and want to pull the metadata for each >>> >>> partition >>> >>> at >>> >>> the rate of 1/sec. It can cost $0.5/sec, which is roughly $40K >>> >>> per >>> >>> day. >>> >>> (b) Semantics: S3 list objects are eventually consistent. So, >>> >>> when >>> >>> you >>> >>> do a >>> >>> list object request, there is no guarantee that you can see all >>> >>> uploaded >>> >>> objects. This could impact the correctness of subsequent >>> >>> logics. >>> >>> (c) Efficiency: Blindly pulling metadata when there is no >>> >>> change adds >>> >>> unnecessary overhead in the broker as well as in the block >>> >>> store. >>> >>> So, have you guys tested S3? If so, could you share your >>> >>> experience >>> >>> in >>> >>> terms of cost, semantics and efficiency? >>> >>> Jun >>> >>> On Tue, Dec 3, 2019 at 10:11 PM Harsha Chintalapani < >>> >>> ka...@harsha.io >>> >>> wrote: >>> >>> Hi Jun, >>> Thanks for the reply. >>> >>> On Tue, Nov 26, 2019 at 3:46 PM, Jun Rao <j...@confluent.io> >>> >>> wrote: >>> >>> Hi, Satish and Ying, >>> >>> Thanks for the reply. >>> >>> 40/41. There are two different ways that we can approach >>> >>> this. >>> >>> One is >>> >>> what >>> >>> you said. We can have an opinionated way of storing and >>> >>> populating >>> >>> the >>> >>> tier >>> >>> metadata that we think is good enough for everyone. I am >>> >>> not >>> >>> sure if >>> >>> this >>> >>> is the case based on what's currently proposed in the KIP. >>> >>> For >>> >>> example, I >>> >>> am not sure that (1) everyone always needs local metadata; >>> >>> (2) >>> >>> the >>> >>> current >>> >>> local storage format is general enough and (3) everyone >>> >>> wants to >>> >>> use >>> >>> the >>> >>> pull based approach to propagate the metadata. Another >>> >>> approach >>> >>> is to >>> >>> make >>> >>> this pluggable and let the implementor implements the best >>> >>> approach >>> >>> for a >>> >>> particular block storage. I haven't seen any comments from >>> >>> Slack/AirBnb >>> >>> in >>> >>> the mailing list on this topic. It would be great if they >>> >>> can >>> >>> provide >>> >>> feedback directly here. >>> >>> The current interfaces are designed with most popular block >>> >>> storages >>> >>> available today and we did 2 implementations with these >>> >>> interfaces and >>> >>> they both are yielding good results as we going through the >>> >>> testing of >>> >>> it. >>> >>> If there is ever a need for pull based approach we can >>> >>> definitely >>> >>> evolve >>> >>> the interface. >>> In the past we did mark interfaces to be evolving to make >>> >>> room for >>> >>> unknowns >>> >>> in the future. >>> If you have any suggestions around the current interfaces >>> >>> please >>> >>> propose we >>> >>> are happy to see if we can work them into it. >>> >>> 43. To offer tier storage as a general feature, ideally all >>> >>> existing >>> >>> capabilities should still be supported. It's fine if the >>> >>> uber >>> >>> implementation doesn't support all capabilities for >>> >>> internal >>> >>> usage. >>> >>> However, the framework should be general enough. >>> >>> We agree on that as a principle. But all of these major >>> >>> features >>> >>> mostly >>> >>> coming right now and to have a new big feature such as tiered >>> >>> storage >>> >>> to >>> >>> support all the new features will be a big ask. We can >>> >>> document on >>> >>> how >>> >>> do >>> >>> we approach solving these in future iterations. >>> Our goal is to make this tiered storage feature work for >>> >>> everyone. >>> >>> 43.3 This is more than just serving the tier-ed data from >>> >>> block >>> >>> storage. >>> >>> With KIP-392, the consumer now can resolve the conflicts >>> >>> with the >>> >>> replica >>> >>> based on leader epoch. So, we need to make sure that >>> >>> leader epoch >>> >>> can be >>> >>> recovered properly from tier storage. >>> >>> We are working on testing our approach and we will update >>> >>> the KIP >>> >>> with >>> >>> design details. >>> >>> 43.4 For JBOD, if tier storage stores the tier metadata >>> >>> locally, we >>> >>> need to >>> >>> support moving such metadata across disk directories since >>> >>> JBOD >>> >>> supports >>> >>> moving data across disks. >>> >>> KIP is updated with JBOD details. Having said that JBOD >>> >>> tooling >>> >>> needs >>> >>> to >>> >>> evolve to support production loads. Most of the users will be >>> >>> interested in >>> >>> using tiered storage without JBOD support support on day 1. >>> >>> Thanks, >>> Harsha >>> >>> As for meeting, we could have a KIP e-meeting on this if >>> >>> needed, >>> >>> but it >>> >>> will be open to everyone and will be recorded and shared. >>> >>> Often, >>> >>> the >>> >>> details are still resolved through the mailing list. >>> >>> Jun >>> >>> On Tue, Nov 19, 2019 at 6:48 PM Ying Zheng >>> >>> <yi...@uber.com.invalid> >>> >>> wrote: >>> >>> Please ignore my previous email >>> I didn't know Apache requires all the discussions to be >>> >>> "open" >>> >>> On Tue, Nov 19, 2019, 5:40 PM Ying Zheng <yi...@uber.com> >>> >>> wrote: >>> >>> Hi Jun, >>> >>> Thank you very much for your feedback! >>> >>> Can we schedule a meeting in your Palo Alto office in >>> >>> December? I >>> >>> think a >>> >>> face to face discussion is much more efficient than >>> >>> emails. Both >>> >>> Harsha >>> >>> and >>> >>> I can visit you. Satish may be able to join us remotely. >>> >>> On Fri, Nov 15, 2019 at 11:04 AM Jun Rao <j...@confluent.io >>> >>> wrote: >>> >>> Hi, Satish and Harsha, >>> >>> The following is a more detailed high level feedback for >>> >>> the KIP. >>> >>> Overall, >>> >>> the KIP seems useful. The challenge is how to design it >>> >>> such that >>> >>> it’s >>> >>> general enough to support different ways of implementing >>> >>> this >>> >>> feature >>> >>> and >>> >>> support existing features. >>> >>> 40. Local segment metadata storage: The KIP makes the >>> >>> assumption >>> >>> that >>> >>> the >>> >>> metadata for the archived log segments are cached locally >>> >>> in >>> >>> every >>> >>> broker >>> >>> and provides a specific implementation for the local >>> >>> storage in >>> >>> the >>> >>> framework. We probably should discuss this more. For >>> >>> example, >>> >>> some >>> >>> tier >>> >>> storage providers may not want to cache the metadata >>> >>> locally and >>> >>> just >>> >>> rely >>> >>> upon a remote key/value store if such a store is already >>> >>> present. If >>> >>> a >>> >>> local store is used, there could be different ways of >>> >>> implementing it >>> >>> (e.g., based on customized local files, an embedded local >>> >>> store >>> >>> like >>> >>> RocksDB, etc). An alternative of designing this is to just >>> >>> provide an >>> >>> interface for retrieving the tier segment metadata and >>> >>> leave the >>> >>> details >>> >>> of >>> >>> how to get the metadata outside of the framework. >>> >>> 41. RemoteStorageManager interface and the usage of the >>> >>> interface in >>> >>> the >>> >>> framework: I am not sure if the interface is general >>> >>> enough. For >>> >>> example, >>> >>> it seems that RemoteLogIndexEntry is tied to a specific >>> >>> way of >>> >>> storing >>> >>> the >>> >>> metadata in remote storage. The framework uses >>> >>> listRemoteSegments() >>> >>> api >>> >>> in >>> >>> a pull based approach. However, in some other >>> >>> implementations, a >>> >>> push >>> >>> based >>> approach may be more preferred. I don’t have a concrete >>> >>> proposal >>> >>> yet. >>> >>> But, >>> >>> it would be useful to give this area some more thoughts >>> >>> and see >>> >>> if we >>> >>> can >>> >>> make the interface more general. >>> >>> 42. In the diagram, the RemoteLogManager is side by side >>> >>> with >>> >>> LogManager. >>> >>> This KIP only discussed how the fetch request is handled >>> >>> between >>> >>> the >>> >>> two >>> >>> layer. However, we should also consider how other requests >>> >>> that >>> >>> touch >>> >>> the >>> >>> log can be handled. e.g., list offsets by timestamp, delete >>> >>> records, >>> >>> etc. >>> >>> Also, in this model, it's not clear which component is >>> >>> responsible >>> >>> for >>> >>> managing the log start offset. It seems that the log start >>> >>> offset >>> >>> could >>> >>> be >>> >>> changed by both RemoteLogManager and LogManager. >>> >>> 43. There are quite a few existing features not covered by >>> >>> the >>> >>> KIP. >>> >>> It >>> >>> would be useful to discuss each of those. >>> 43.1 I won’t say that compacted topics are rarely used and >>> >>> always >>> >>> small. >>> >>> For example, KStreams uses compacted topics for storing the >>> >>> states >>> >>> and >>> >>> sometimes the size of the topic could be large. While it >>> >>> might >>> >>> be ok >>> >>> to >>> >>> not >>> >>> support compacted topics initially, it would be useful to >>> >>> have a >>> >>> high >>> >>> level >>> idea on how this might be supported down the road so that >>> >>> we >>> >>> don’t >>> >>> have >>> >>> to >>> >>> make incompatible API changes in the future. >>> 43.2 We need to discuss how EOS is supported. In >>> >>> particular, how >>> >>> is >>> >>> the >>> >>> producer state integrated with the remote storage. 43.3 >>> >>> Now that >>> >>> KIP-392 >>> >>> (allow consumers to fetch from closest replica) is >>> >>> implemented, >>> >>> we >>> >>> need >>> >>> to >>> >>> discuss how reading from a follower replica is supported >>> >>> with >>> >>> tier >>> >>> storage. >>> >>> 43.4 We need to discuss how JBOD is supported with tier >>> >>> storage. >>> >>> Thanks, >>> >>> Jun >>> >>> On Fri, Nov 8, 2019 at 12:06 AM Tom Bentley < >>> >>> tbent...@redhat.com >>> >>> wrote: >>> >>> Thanks for those insights Ying. >>> >>> On Thu, Nov 7, 2019 at 9:26 PM Ying Zheng >>> >>> <yi...@uber.com.invalid >>> >>> wrote: >>> >>> Thanks, I missed that point. However, there's still a >>> >>> point at >>> >>> which >>> >>> the >>> >>> consumer fetches start getting served from remote storage >>> >>> (even >>> >>> if >>> >>> that >>> >>> point isn't as soon as the local log retention time/size). >>> >>> This >>> >>> represents >>> >>> a kind of performance cliff edge and what I'm really >>> >>> interested >>> >>> in >>> >>> is >>> >>> how >>> >>> easy it is for a consumer which falls off that cliff to >>> >>> catch up >>> >>> and so >>> >>> its >>> >>> fetches again come from local storage. Obviously this can >>> >>> depend >>> >>> on >>> >>> all >>> >>> sorts of factors (like production rate, consumption rate), >>> >>> so >>> >>> it's >>> >>> not >>> >>> guaranteed (just like it's not guaranteed for Kafka >>> >>> today), but >>> >>> this >>> >>> would >>> >>> represent a new failure mode. >>> >>> As I have explained in the last mail, it's a very rare >>> >>> case that >>> >>> a >>> >>> consumer >>> need to read remote data. With our experience at Uber, >>> >>> this only >>> >>> happens >>> >>> when the consumer service had an outage for several hours. >>> >>> There is not a "performance cliff" as you assume. The >>> >>> remote >>> >>> storage >>> >>> is >>> >>> even faster than local disks in terms of bandwidth. >>> >>> Reading from >>> >>> remote >>> >>> storage is going to have higher latency than local disk. >>> >>> But >>> >>> since >>> >>> the >>> >>> consumer >>> is catching up several hours data, it's not sensitive to >>> >>> the >>> >>> sub-second >>> >>> level >>> latency, and each remote read request will read a large >>> >>> amount of >>> >>> data to >>> >>> make the overall performance better than reading from local >>> >>> disks. >>> >>> Another aspect I'd like to understand better is the effect >>> >>> of >>> >>> serving >>> >>> fetch >>> >>> request from remote storage has on the broker's network >>> >>> utilization. If >>> >>> we're just trimming the amount of data held locally >>> >>> (without >>> >>> increasing >>> >>> the >>> >>> overall local+remote retention), then we're effectively >>> >>> trading >>> >>> disk >>> >>> bandwidth for network bandwidth when serving fetch >>> >>> requests from >>> >>> remote >>> >>> storage (which I understand to be a good thing, since >>> >>> brokers are >>> >>> often/usually disk bound). But if we're increasing the >>> >>> overall >>> >>> local+remote >>> >>> retention then it's more likely that network itself >>> >>> becomes the >>> >>> bottleneck. >>> >>> I appreciate this is all rather hand wavy, I'm just trying >>> >>> to >>> >>> understand >>> >>> how this would affect broker performance, so I'd be >>> >>> grateful for >>> >>> any >>> >>> insights you can offer. >>> >>> Network bandwidth is a function of produce speed, it has >>> >>> nothing >>> >>> to >>> >>> do >>> >>> with >>> >>> remote retention. As long as the data is shipped to remote >>> >>> storage, >>> >>> you >>> >>> can >>> >>> keep the data there for 1 day or 1 year or 100 years, it >>> >>> doesn't >>> >>> consume >>> >>> any >>> network resources. >>> >>> >>