Hi Igor,

Thanks for all your work on this! You've really done a lot to push it over the 
finish line. It will be awesome to see this coming to a 3.x release soon.

I did a preliminary pass today. Can I please ask that you give us two weeks 
before you close the vote, since it's such a big change?

I have a few major concerns that are top of mind. Firstly, I'd like to avoid 
creating a big performance impact on clusters that don't use JBOD. Secondly, I 
wanted to dig into the broker/controller interactions here a bit. Finally, 
wanted to talk about meta.properties.

So, let me start by discussing the performance impact on non-JBOD clusters. 
UUIDs are big (8 bytes) and adding 3 of them to the average partition 
assignment will add 24 bytes to PartitionRecord. The size of PartitionRecord is 
critical since a big cluster will have lots and lots of partitions. So I think 
24 extra bytes per partition is a lot of overhead to pay for systems that 
aren't using JBOD.

I think we can avoid this by using a tagged field. Then we can say that if the 
driectory UUID information isn't present, we assume that the partition is in 
the directory corresponding to UUID 0. Then, we can also update the formatting 
tool (and auto-id-generator, etc.) to honor this, setting the first directory 
in the system to UUID 0. Subsequent ones will get a random, non-zero ID, of 
course.

I think this gives us the best of both worlds. There is no performance impact 
on non-JBOD, but for JBOD systems, we have a unique UUID for each directory. It 
also probably makes upgrade easier since we start in a valid state 
automatically (everything in directory 0).

The next thing I wanted to talk about was broker/controller interaction. After 
reading the KIP, it wasn't clear to me whether the controller would be 
responsible for picking a storage directory for new partitions, or whether the 
broker would be. A few times in the doc, you suggest that if the broker has the 
partition in the "wrong" directory, it should move it. That would suggest the 
controller is in control.

If the controller is in control, then we should describe what its policy is for 
picking a storage directory for new replicas. Probably choosing randomly is 
fine as a start.

With controller-side placement, we probably also want a mechanism for the 
broker to be able to ask the controller to stop placing replicas in a certain 
directory, even if it is still working. This would be used if a directory 
starts filling up. I suppose this could be a follow-on KIP as well. However, if 
we don't have this, then we are constraining people to use only roughly equally 
sized directories, and we should note that.

I'd like to understand how directory reassignment will work in this new system. 
In ZK mode, this is done in an entirely broker-side way. The client sends an 
RPC asking the broker to move a replica between directories, and it does so, 
without involving the controller. Are we going to support the same kind of 
system? Or do we want this as well to be controller side?

There are some problems to solve if we want to make inter-directory 
reassignment initiated by the controller, without the broker in the loop. If 
implemented naively, we would violate our durability gauratees.

For example, let's say the client asks the controller to move a replica on 
broker A from directory X to Y. But then before broker learns about this move, 
the broker restarts without directory X (perhaps it has failed, or been 
reconfigured away.) Now the broker is left with just instructions to create the 
replica on directory Y. But it will create an empty replica, which could lead 
to us "going backwards" if the replica was in the ISR.

We can avoid this problem in a few ways. One is to store a "previous log dir 
UUID" in PartitionRecord. But this is kind of awkward because it increases 
metadata size, and the state must be managed. A second way is to have the 
broker perform the move and then ask the controller to "finalize" it in the 
metadata. I think this was your intention with AssignReplicasToDirsRequest, but 
we should spell this out a bit more clearly. If that's the case, then 
AssignReplicasToDirsRequest can only be made by brokers, not be clients.

 A diagram of how a partition move progresses (starting with the client 
initiating it) would help here.

There are a few other smaller comments I had. In PartitionRecord, I would 
prefer a new (tagged) array for replica UUIDs, rather than creating the 
ReplicaAssignment array.

I think the offline log dirs array in BrokerHeartbeatRequest should be tagged 
as well, to conserve space (heartbeats are sent often, and directory failures 
are rare.) Also OfflineLogDirs seems like a better name than LogDirsOfflined 
(and more consistent with the registration request RPC).

The last thing I wanted to mention in this overview email was meta.properties. 
There is some ambiguity with directory.ids because if we can't find one of the 
specified IDs, it could be because of a broker static configuration change that 
added or removed a directory, or a failed directory.

To solve this, maybe we should store a mapping between directory paths and 
UUIDs in meta.properties. For example:
directory.id./my/log/dir=UpZrvl3DRf6QReI74A417w
directory.id./my/log/dir2=JLEIRzVKR7iJ0-JgQm8niA

We would have to have some way of escaping the equals sign if it appeared in a 
path, of course. I'm not sure, the directory.ids issue is a frustrating one.

best,
Colin


On Tue, Jun 13, 2023, at 00:56, ziming deng wrote:
> Hi Igor, Thanks for this work.
>
> +1(binding) from me
>
> --
> Best,
> Ziming
>
>> On Jun 12, 2023, at 18:07, Igor Soarez <soa...@apple.com.INVALID> wrote:
>> 
>> Hi everyone,
>> 
>> We're getting closer to dropping ZooKeeper support, and JBOD
>> in KRaft mode is one of the outstanding big missing features.
>> 
>> It's been a while since there was new feedback on KIP-858 [1]
>> which aims to address this gap, so I'm calling for a vote.
>> 
>> A huge thank you to everyone who has reviewed this KIP, and
>> participated in the discussion thread! [2]
>> 
>> I'd also like to thank you in advance for taking the time to vote.
>> 
>> Best,
>> 
>> --
>> Igor
>> 
>> [1] 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
>> [2] https://lists.apache.org/thread/8dqvfhzcyy87zyy12837pxx9lgsdhvft
>>

Reply via email to