Hi Igor, Thanks for all your work on this! You've really done a lot to push it over the finish line. It will be awesome to see this coming to a 3.x release soon.
I did a preliminary pass today. Can I please ask that you give us two weeks before you close the vote, since it's such a big change? I have a few major concerns that are top of mind. Firstly, I'd like to avoid creating a big performance impact on clusters that don't use JBOD. Secondly, I wanted to dig into the broker/controller interactions here a bit. Finally, wanted to talk about meta.properties. So, let me start by discussing the performance impact on non-JBOD clusters. UUIDs are big (8 bytes) and adding 3 of them to the average partition assignment will add 24 bytes to PartitionRecord. The size of PartitionRecord is critical since a big cluster will have lots and lots of partitions. So I think 24 extra bytes per partition is a lot of overhead to pay for systems that aren't using JBOD. I think we can avoid this by using a tagged field. Then we can say that if the driectory UUID information isn't present, we assume that the partition is in the directory corresponding to UUID 0. Then, we can also update the formatting tool (and auto-id-generator, etc.) to honor this, setting the first directory in the system to UUID 0. Subsequent ones will get a random, non-zero ID, of course. I think this gives us the best of both worlds. There is no performance impact on non-JBOD, but for JBOD systems, we have a unique UUID for each directory. It also probably makes upgrade easier since we start in a valid state automatically (everything in directory 0). The next thing I wanted to talk about was broker/controller interaction. After reading the KIP, it wasn't clear to me whether the controller would be responsible for picking a storage directory for new partitions, or whether the broker would be. A few times in the doc, you suggest that if the broker has the partition in the "wrong" directory, it should move it. That would suggest the controller is in control. If the controller is in control, then we should describe what its policy is for picking a storage directory for new replicas. Probably choosing randomly is fine as a start. With controller-side placement, we probably also want a mechanism for the broker to be able to ask the controller to stop placing replicas in a certain directory, even if it is still working. This would be used if a directory starts filling up. I suppose this could be a follow-on KIP as well. However, if we don't have this, then we are constraining people to use only roughly equally sized directories, and we should note that. I'd like to understand how directory reassignment will work in this new system. In ZK mode, this is done in an entirely broker-side way. The client sends an RPC asking the broker to move a replica between directories, and it does so, without involving the controller. Are we going to support the same kind of system? Or do we want this as well to be controller side? There are some problems to solve if we want to make inter-directory reassignment initiated by the controller, without the broker in the loop. If implemented naively, we would violate our durability gauratees. For example, let's say the client asks the controller to move a replica on broker A from directory X to Y. But then before broker learns about this move, the broker restarts without directory X (perhaps it has failed, or been reconfigured away.) Now the broker is left with just instructions to create the replica on directory Y. But it will create an empty replica, which could lead to us "going backwards" if the replica was in the ISR. We can avoid this problem in a few ways. One is to store a "previous log dir UUID" in PartitionRecord. But this is kind of awkward because it increases metadata size, and the state must be managed. A second way is to have the broker perform the move and then ask the controller to "finalize" it in the metadata. I think this was your intention with AssignReplicasToDirsRequest, but we should spell this out a bit more clearly. If that's the case, then AssignReplicasToDirsRequest can only be made by brokers, not be clients. A diagram of how a partition move progresses (starting with the client initiating it) would help here. There are a few other smaller comments I had. In PartitionRecord, I would prefer a new (tagged) array for replica UUIDs, rather than creating the ReplicaAssignment array. I think the offline log dirs array in BrokerHeartbeatRequest should be tagged as well, to conserve space (heartbeats are sent often, and directory failures are rare.) Also OfflineLogDirs seems like a better name than LogDirsOfflined (and more consistent with the registration request RPC). The last thing I wanted to mention in this overview email was meta.properties. There is some ambiguity with directory.ids because if we can't find one of the specified IDs, it could be because of a broker static configuration change that added or removed a directory, or a failed directory. To solve this, maybe we should store a mapping between directory paths and UUIDs in meta.properties. For example: directory.id./my/log/dir=UpZrvl3DRf6QReI74A417w directory.id./my/log/dir2=JLEIRzVKR7iJ0-JgQm8niA We would have to have some way of escaping the equals sign if it appeared in a path, of course. I'm not sure, the directory.ids issue is a frustrating one. best, Colin On Tue, Jun 13, 2023, at 00:56, ziming deng wrote: > Hi Igor, Thanks for this work. > > +1(binding) from me > > -- > Best, > Ziming > >> On Jun 12, 2023, at 18:07, Igor Soarez <soa...@apple.com.INVALID> wrote: >> >> Hi everyone, >> >> We're getting closer to dropping ZooKeeper support, and JBOD >> in KRaft mode is one of the outstanding big missing features. >> >> It's been a while since there was new feedback on KIP-858 [1] >> which aims to address this gap, so I'm calling for a vote. >> >> A huge thank you to everyone who has reviewed this KIP, and >> participated in the discussion thread! [2] >> >> I'd also like to thank you in advance for taking the time to vote. >> >> Best, >> >> -- >> Igor >> >> [1] >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft >> [2] https://lists.apache.org/thread/8dqvfhzcyy87zyy12837pxx9lgsdhvft >>