Hi David, This looks good. I just have a few comments:
1. I'm not sure it's totally fair to describe the current notification mechanism as "best-effort." At least it guarantees that the controller will eventually see the event. In any case, I think we might want a stronger contract going forward. As long as the broker remains the leader for partitions in offline log directories, it seems like we should retry the AlterReplicaState requests. 2. Should we consider a new name for `UNKNOWN_REPLICA_EVENT_TYPE`? Maybe `UNKOWN_REPLICA_STATE`? 3. Mostly an observation, but there is some overlap with this API and ControlledShutdown. From the controller's perspective, the intent is mostly the same. I guess we could treat a null array in the request as an intent to shutdown all replicas if we wanted to try and converge these APIs. One of the differences is that ControlledShutdown is a synchronous API, but I think it would have actually been better as an asynchronous API since historically we have run into problems with timeouts. Anyway, this is outside the scope of this KIP, but might be worth mentioning as "Future work" somewhere. Thanks, Jason On Mon, May 18, 2020 at 10:09 AM David Arthur <mum...@gmail.com> wrote: > I've updated the KIP with the feedback from this discussion > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller > . > I'll send out the vote thread shortly. > > Thanks again, > David > > On Tue, May 5, 2020 at 10:34 AM Tom Bentley <tbent...@redhat.com> wrote: > > > Hi Colin, > > > > Yeah, that makes sense, thanks. I was thinking, longer term, that there > are > > other benefits to having the log dir information available to the > > controller. For example it would allow the possibility for CREATE_TOPIC > > requests to include the intended log dir for each replica. But that's > > obviously completely out of scope for this KIP. > > > > Kind regards, > > > > Tom > > > > On Mon, May 4, 2020 at 10:11 PM Colin McCabe <cmcc...@apache.org> wrote: > > > > > Hi Tom, > > > > > > As you said, the controller doesn't know about log directories, > although > > > individual brokers do. So the brokers do currently have to enumerate > all > > > the partitions that need to be removed to the controllers explicitly. > So > > > this KIP isn't changing anything in that regard. > > > > > > The current flow is: > > > 1. ping ZK back-channel > > > 2. controller sends a full LeaderAndIsrRequest to the broker > > > 3. the broker sends a full response containing error codes for all > > > partitions. Partitions on the failed storage have a nonzero error > code; > > > the others have 0. > > > > > > The new flow is: > > > 1. the broker sends an RPC with all the failed partitions > > > > > > So the new flow actually substantially reduces the amount of network > > > traffic, since previously we sent a full LeaderAndIsrRequest containing > > all > > > of the partitions. Now we just send all the partitions in the failed > > > storage directory. That could still be a lot, but certainly only be a > > > fraction of what a full LeaderAndIsrRequest would have. > > > > > > Sorry if I'm repeating stuff you already figured out, but I just wanted > > to > > > be more clear about this (I found it confusing too until David > explained > > it > > > to me originally...) > > > > > > best, > > > Colin > > > > > > > > > On Sat, May 2, 2020, at 10:30, Tom Bentley wrote: > > > > Hi David, > > > > > > > > > In the rejecting the alternative of having an RPC for log dir > > failures > > > > > you say > > > > > > > > > > I guess what I really mean here is that I wanted to avoid exposing > > the > > > > > notion of a log dir to the controller. I can update the description > > to > > > > > reflect this. > > > > > > > > > > > > > Ah, I think I see now. While each broker knows about its log dirs > this > > > > isn't something that's stored in zookeeper or known to the > controller. > > > > > > > > > > > > > > It's also not completely clear that the cost of having to > enumerate > > > all > > > > > the partitions on a log dir was weighed against the perceived > benefit > > > of a > > > > > more flexible RPC. > > > > > > > > > > The enumeration isn't strictly required. In the "RPC semantics" > > > section, I > > > > > mention that if no topics are present in the RPC request, then all > > > topics > > > > > on the broker are implied. And if a topic is given with no > > partitions, > > > all > > > > > partitions for that topic (on the broker) are implied. Does this > make > > > > > sense? > > > > > > > > > > > > > So the no-topics-present optimisation wouldn't be available to a > broker > > > > with >1 log dirs where only some of the log dirs failed. I don't > > suppose > > > > that's a problem though. > > > > > > > > Thanks again, > > > > > > > > Tom > > > > > > > > > > > > On Fri, May 1, 2020 at 5:48 PM David Arthur <mum...@gmail.com> > wrote: > > > > > > > > > Jose/Colin/Tom, thanks for the feedback! > > > > > > > > > > > Partition level errors > > > > > > > > > > This was an oversight on my part, I meant to include these in the > > > response > > > > > RPC. I'll update that. > > > > > > > > > > > INVALID_REQUEST > > > > > > > > > > I'll update this text description, that was a copy/paste left over > > > > > > > > > > > I think we should mention that the controller will keep it's > > current > > > > > implementation of marking the replicas as offline because of > failure > > > in the > > > > > LeaderAndIsr response. > > > > > > > > > > Good suggestions, I'll add that. > > > > > > > > > > > Does EventType need to be an Int32? > > > > > > > > > > No, it doesn't. I'll update to Int8. Do we have an example of the > > enum > > > > > paradigm in our RPC today? I'm curious if we actually map it to a > > real > > > Java > > > > > enum in the AbstractRequest/Response classes. > > > > > > > > > > > AlterReplicaStates > > > > > > > > > > Sounds good to me. > > > > > > > > > > > In the rejecting the alternative of having an RPC for log dir > > > failures > > > > > you say > > > > > > > > > > I guess what I really mean here is that I wanted to avoid exposing > > the > > > > > notion of a log dir to the controller. I can update the description > > to > > > > > reflect this. > > > > > > > > > > > It's also not completely clear that the cost of having to > enumerate > > > all > > > > > the partitions on a log dir was weighed against the perceived > benefit > > > of a > > > > > more flexible RPC. > > > > > > > > > > The enumeration isn't strictly required. In the "RPC semantics" > > > section, I > > > > > mention that if no topics are present in the RPC request, then all > > > topics > > > > > on the broker are implied. And if a topic is given with no > > partitions, > > > all > > > > > partitions for that topic (on the broker) are implied. Does this > make > > > > > sense? > > > > > > > > > > Thanks again! I'll update the KIP and leave a message here once > it's > > > > > revised. > > > > > > > > > > David > > > > > > > > > > On Wed, Apr 29, 2020 at 11:20 AM Tom Bentley <tbent...@redhat.com> > > > wrote: > > > > > > > > > > > Hi David, > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > In the rejecting the alternative of having an RPC for log dir > > > failures > > > > > you > > > > > > say: > > > > > > > > > > > > It was also rejected to prevent "leaking" the notion of a log dir > > to > > > the > > > > > > > public API. > > > > > > > > > > > > > > > > > > > I'm not quite sure I follow that argument, since we already have > > > RPCs for > > > > > > changing replica log dirs. So in a general sense log dirs already > > > exist > > > > > in > > > > > > the API. I suspect you were using public API to mean something > more > > > > > > specific; could you elaborate? > > > > > > > > > > > > It's also not completely clear that the cost of having to > enumerate > > > all > > > > > the > > > > > > partitions on a log dir was weighed against the perceived benefit > > of > > > a > > > > > more > > > > > > flexible RPC. (I'm sure it was, but it would be good to say so). > > > > > > > > > > > > Many thanks, > > > > > > > > > > > > Tom > > > > > > > > > > > > On Wed, Apr 29, 2020 at 12:04 AM Colin McCabe < > cmcc...@apache.org> > > > > > wrote: > > > > > > > > > > > > > Hi David, > > > > > > > > > > > > > > Thanks for the KIP! > > > > > > > > > > > > > > I think the ReplicaStateEventResponse should have a separate > > error > > > code > > > > > > > for each partition. > > > > > > > Currently it just has one error code for the whole > > > request/response, > > > > > if > > > > > > > I'm reading this right. I think Jose made a similar point as > > > well. We > > > > > > > should plan for scenarios where some replica states can be > > changed > > > and > > > > > > some > > > > > > > can't. > > > > > > > > > > > > > > Does EventType need to be an Int32? For enums, we usually use > > the > > > > > > > smallest reasonable type, which would be Int8 here. We can > > always > > > > > change > > > > > > > the schema later if needed. UNKNOWN_REPLICA_EVENT_TYPE seems > > > > > unnecessary > > > > > > > since INVALID_REQUEST covers this case. > > > > > > > > > > > > > > I'd also suggest "AlterReplicaStates[Request,Response]" as a > > > slightly > > > > > > > better name for this RPC. > > > > > > > > > > > > > > cheers, > > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2020, at 12:43, David Arthur wrote: > > > > > > > > Hey everyone, > > > > > > > > > > > > > > > > I'd like to start the discussion for KIP-589, part of the > > KIP-500 > > > > > > effort > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller > > > > > > > > > > > > > > > > This is a proposal to use a new RPC instead of ZooKeeper for > > > > > notifying > > > > > > > the > > > > > > > > controller of an offline replica. Please give a read and let > me > > > know > > > > > > your > > > > > > > > thoughts. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > David > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > David Arthur > > > > > > > > > > > > > > > > > > > > -- > David Arthur >