Hi Jun, Thank you so much for your thoughtful feedback—I sincerely apologize for the delayed response; I somehow missed your email.
I truly appreciate the insights you’ve shared; they pinpointed several critical areas that deserve deeper consideration. > JR1. One potential downside of using RF=1 is availability. Since there is > no active standby replica, on a failover, the new leader needs to recover > the log and rebuild/reload the state before it can serve writes. So, the > window of unavailability could be large if the producer state is large. You > mentioned that your benchmark showed 1-2s leader failover time. What does > the number look like if there are many clients, each with idempotent > producer or transactions enabled? This also impacts scalability in the same > way. Failover typically consists of three phases: Failure Detection, Leader Switchover, and Client Awareness. Under RF=3, leader election handles the switchover. Under RF=1, we rely on partition movement (reopening the partition via shared storage), which adds a few extra seconds. However, this has a small impact on overall failover latency. Anyway, this is indeed a trade-off made by KIP-1183. In a Cloud environment, compared to an On-premise environment, the probability of Failure is much lower. This might be a worthwhile trade-off if the benefits are significant enough. And, we haven’t yet encountered large-scale deployments with idempotent producers enabled—this is something we need to validate. > JR2 Another potential downside of RF=1 is that it reduces the opportunity > for achieving consumer affinity. If a consumer application does operations > like joining two topics, having more than 1 read replica enables more > opportunities for aligning with the consumers. Using RF=1 is more about wanting to avoid replicating data at the Kafka Layer. In fact, object storage makes it easier to scale reads, such as by adding some read-only replicas. There is no replication semantics; instead, it leverages the shared properties of object storage to improve fan-out. > JR3. Most types of block storage seem to be designed for a single zone and > don't provide strong durability and availability. So, it's not clear how it > can be used with RF=1. Decouple Durability to cloud storage, so Durability is also limited by what cloud storage provides. Providing Regional durability for block storage is also an important trend, and among the Top 4 cloud providers, currently only AWS does not offer it. Additionally, although KIP-1183 does not currently involve the implementation of the Stream Layer, in AutoMQ's implementation, Object is still the primary storage (WAL and Data). However, in some low-latency cases, blocks or files can be used as the implementation of WAL to provide low latency. > JR4. I agree with Satish that it seems there is a lot of work left for the > plugin implementer. The main idea of KIP-1183 is to first have a relatively abstract Storage Layer that can support the community in simultaneously iterating on the classic ISR architecture and the shared storage architecture. This is because the current storage engine includes a lot of state management done through local files, which is not very suitable for directly moving to object storage. Plugins always make people uneasy, so could we consider a long-term direction of having an abstract Storage Layer with two separate implementations for local disk and cloud storage? This would also avoid the need to reimplement a large number of Kafka features as in Coordinator-based solutions. Although it seems to have a relatively high implementation cost, these changes are mostly confined to the storage layer, having a smaller impact on the community's iterative work. Thank you again for your valuable input, Jun. I really appreciate the depth of your analysis. Wishing you a great day! Best regards, Xinyu On Thu, Aug 7, 2025 at 12:53 AM Jun Rao <[email protected]> wrote: > Hi, Xinyu, > > Thanks for the KIP. A few high level comments. > > JR1. One potential downside of using RF=1 is availability. Since there is > no active standby replica, on a failover, the new leader needs to recover > the log and rebuild/reload the state before it can serve writes. So, the > window of unavailability could be large if the producer state is large. You > mentioned that your benchmark showed 1-2s leader failover time. What does > the number look like if there are many clients, each with idempotent > producer or transactions enabled? This also impacts scalability in the same > way. > > JR2 Another potential downside of RF=1 is that it reduces the opportunity > for achieving consumer affinity. If a consumer application does operations > like joining two topics, having more than 1 read replica enables more > opportunities for aligning with the consumers. > > JR3. Most types of block storage seem to be designed for a single zone and > don't provide strong durability and availability. So, it's not clear how it > can be used with RF=1. > > JR4. I agree with Satish that it seems there is a lot of work left for the > plugin implementer. For example, > * fencing logic to prevent an old runaway leader from continuing to write > to the shared storage > * managing the metadata for shared storage > * merging smaller objects into bigger ones > * maintaining a read cache > This makes it almost impossible for anyone to implement a plugin. > > Jun > > On Thu, May 15, 2025 at 6:37 PM Xinyu Zhou <[email protected]> wrote: > > > Hi Colin, > > > > Thank you for taking the time to read this KIP, and no worries, negative > > feedback is a catalyst for improvement. > > > > Sorry for the inappropriate description in the Motivation section; my > > background influenced my writing, but I didn’t mean it. I will remove it. > > Thanks for the reminder. > > > > I completely agree with your point on fragmentation risk. I've seen many > > companies maintain their own Kafka fork branches internally, often > focusing > > on the storage layer. If the storage layer is more scalable, I think it > > would help reduce fragmentation. > > > > On another note, transitioning Kafka from on-premise to cloud is a > > long-term process, but we can't ignore cloud needs entirely. Therefore, > the > > community may need to support two storage implementations in the > > foreseeable future, and we should make the storage layer more abstract to > > support both. > > > > Regarding the relationship between KIP-1183 and 1150, and 1176, as > > mentioned in the KIP, the architecture of 1150 actually conflicts with > > Kafka's leader-based architecture. As Jun pointed out, transactions and > > queues rely on leader-based partitions. How 1150 handles current and > future > > features, if they all need to be implemented twice, is a huge burden. > > > > For KIP-1176, which I really like, it mainly tries to solve the > replication > > traffic cost issue, but doesn't leverage other advantages of shared > > storage. We can certainly accept KIP-1176, but what's next? We may still > > need to discuss how to better support Kafka on cloud storage for > elasticity > > and operational advantages. > > > > Regarding NFS, yes, Kafka can run on NFS, but it can't utilize NFS's > shared > > capabilities. For example, data written by Broker A on NFS can't be > sensed > > by Broker B, so even on NFS, reassigning a partition still requires > > replication. > > > > In summary, KIP-1183 aims to discuss how the community views the impact > of > > shared storage on the current architecture. Should we embrace it, and > when? > > So, I think we should at least reach consensus on these two points: > > 1. We should consider how to support shared storage, but the community > > needs to support both local disk and shared storage long-term. > > 2. Which path should we take? The leaderless architecture of 1150 or the > > approach mentioned in 1183. > > > > I will update the KIP with our discussion soon. Thanks again for your > time! > > > > Best, > > Xinyu > > > > On Fri, May 16, 2025 at 7:33 AM Colin McCabe <[email protected]> wrote: > > > > > Hi Xinyu Zhou, > > > > > > Thanks for the KIP. It's good to see more people contributing to the > > > community. I think this is your first KIP, so please forgive me for > > giving > > > some negative feedback. > > > > > > KIPs need to be written in a vendor-neutral manner, for the whole > > > community. So please do not do things like begin a paragraph with "At > > > AutoMQ, our goal is..." We really need to focus on the goals of Apache > > > Kafka, not the goals of a specific vendor. > > > > > > Similarly, it's probably not a good idea to call out all the specific > > > vendors that have forked Kafka or implemented the Kafka API. We trust > > that > > > the work people are contributing to AK is Apache licensed and not based > > on > > > something proprietary, as per our CLA. So we should review the actual > > > proposed design. > > > > > > In the KIP-1150 discussion thread, I called out the pluggable APIs that > > > were being proposed as a possible fragmentation risk. I am concerned > that > > > the pluggable APIs here could pose an even greater risk. For example, > if > > we > > > end up with a dozen different overlapping AbstractLog implementations, > it > > > will be hard to see that as anything but "disunity." It also means that > > it > > > will be much harder to evolve the core of Kafka. > > > > > > After reading this KIP, I'm left confused about what its relationship > > with > > > KIP-1150 and KIP-1176 are. The text even states "there are no rejected > > > alternatives." But I really disagree with the idea that we can evaluate > > > this proposal without understanding its relationship to alternate > > > proposals. We need to answer the question of why this KIP is necessary > if > > > we have KIP-1150 or KIP-1176. After all, those KIPs come with (small) > > > pluggable pieces that allow Kafka to hook into multiple blobstores. > (And > > > NFS, of course, doesn't need any plugin at all since it exposes a > > > file-based interface.) So we really need to understand what this KIP > > brings > > > to the table. That should go in the "rejected alternatives" section. > > > > > > Overall, I would encourage you to propose a concrete design rather > than a > > > set of plugin APIs. We cannot really evaluate APIs without > understanding > > > the implementation. > > > > > > best, > > > Colin > > > > > > > > > On Tue, May 13, 2025, at 05:21, Xinyu Zhou wrote: > > > > Dear Kafka Community, > > > > > > > > I am proposing a new KIP to introduce a unified shared storage > > > > solution for Kafka, aiming > > > > to enhance its scalability and flexibility. This KIP is inspired by > > > > the ongoing discussions > > > > around KIP-1150 and KIP-1176, which explore leveraging object storage > > > > to achieve cost and > > > > elasticity benefits. These efforts are commendable, but given the > > > > widespread adoption of > > > > Kafka's classic shared-nothing architecture, especially in on-premise > > > > environments, we > > > > need a unified approach that supports a smooth transition from > > > > shared-nothing to shared > > > > storage. This KIP proposes refactoring the log layer to support both > > > > architectures > > > > simultaneously, ensuring long-term compatibility and allowing Kafka > to > > > > fully leverage > > > > shared storage services like S3, HDFS, and NFS. > > > > > > > > The core of this proposal includes introducing abstract log and log > > > > segment classes and a > > > > new 'Stream' API to bridge the gap between shared storage services > and > > > > Kafka's storage > > > > layer. This unified solution will enable Kafka to evolve while > > > > maintaining backward > > > > compatibility, supporting both on-premise and cloud deployments. I > > > > believe this approach > > > > is crucial for Kafka's continued success and look forward to your > > > > thoughts and feedback. > > > > > > > > > > > > Link to the KIP for more details: > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1183%3A+Unified+Shared+Storage > > > > > > > > Best regards, > > > > > > > > Xinyu > > > > > >
