Hi, Dong, I think Tom was suggesting to have the AlterTopicsRequest sent to any broker, which just writes the reassignment json to ZK. The controller will pick up the reassignment and act on it as usual. This should work, right?
Having a separate AlterTopicsRequest and AlterReplicaDirRequest seems simpler to me. The former is handled by the controller and the latter is handled by the affected broker. They don't always have to be done together. Merging the two into a single request probably will make both the api and the implementation a bit more complicated. If we do keep the two separate requests, it seems that we should just add AlterReplicaDirRequest to the AdminClient interface? Now, regarding DescribeDirsResponse. I agree that it can be used for the status reporting in KIP-179 as well. However, it seems that reporting the log end offset of each replica may not be easy to use. The log end offset will be returned from different brokers in slightly different time. If there is continuous producing traffic, the difference in log end offset between the leader and the follower could be larger than 0 even if the follower has fully caught up. I am wondering if it's better to instead return the lag in offset per replica. This way, the status can probably be reported more reliably. Thanks, Jun On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com> wrote: > Hey Tom, > > Thanks for the quick reply. Please see my comment inline. > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <t.j.bent...@gmail.com> > wrote: > > > Hi Dong, > > > > Replies inline, as usual > > > > > As I originally envisaged it, KIP-179's support for reassigning > > partitions > > > > > > would have more-or-less taken the logic currently in the > > > > ReassignPartitionsCommand (that is, writing JSON to the > > > > ZkUtils.ReassignPartitionsPath) > > > > and put it behind a suitable network protocol API. Thus it wouldn't > > > matter > > > > which broker received the protocol call: It would be acted on by > > brokers > > > > being notified of the change in the ZK path, just as currently. This > > > would > > > > have kept the ReassignPartitionsCommand relatively simple, as it > > > currently > > > > is. > > > > > > > > > > I am not sure I fully understand your proposal. I think you are saying > > that > > > any broker can receive and handle the AlterTopicRequest. > > > > > > That's right. > > > > > > > Let's say a > > > non-controller broker received AlterTopicRequest, is this broker going > to > > > send LeaderAndIsrRequest to other brokers? Or is this broker create the > > > reassignment znode in zookeper? > > > > > > Exactly: It's going to write some JSON to the relevant znode. Other > brokers > > will get notified by zk when the contents of this znode changes, and do > as > > they do now. This is what the tool/script does now. > > > > I will confess that I don't completely understand the role of > > LeaderAndIsrRequest, since the current code just seems to write to the > > znode do get the brokers to do the reassignment. If you could explain the > > role of LeaderAndIsrRequest that would be great. > > > > Currently only the controller will listen to the reassignment znode and > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in order to > complete reassignment. Brokers won't need to listen to zookeeper for any > reassignment -- brokers only reacts to the request from controller. > Currently Kafka's design replies a lot on the controller to keep a > consistent view of who are the leader of partitions and what is the ISR > etc. It will be a pretty drastic change, if not impossible, for the script > to reassign partitions without going through controller. > > Thus I think it is likely that your AlterTopicsRequest can only be sent to > controller. Then the controller can create the reassignment znode in > zookeeper so that the information is persisted across controller fail over. > I haven't think through this in detail though. > > > > > > > > > > I may have missed it. But I couldn't find > > > the explanation of AlterTopicRequest handling in KIP-179. > > > > > > > You're right, it doesn't go into that much detail. I will fix that. > > > > > > > > > > > > KIP-113 is obviously seeking to make more radical changes. The > > algorithm > > > > described for moving a replica to a particular directory on a > different > > > > broker ( > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113: > > > > Supportreplicasmovementbetweenlogdirectories-2) > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113: > > > > Supportreplicasmovementbetweenlogdirectories-2% > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>) > > > > involves both sending AlterReplicaDirRequest to "the" broker (the > > > receiving > > > > broker, I assume, but it's not spelled out), _as well as_ writing to > > the > > > ZK > > > > node. > > > > > > > > This assumes the script (ReassignPartitionsCommand) has direct access > > to > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It seems a > > > waste > > > > of time to put the logic in the script as part of KIP-113, only for > > > KIP-179 > > > > to have to move it back to the controller. > > > > > > > > > > I am not sure I understand what you mean by "It seems a waste of time > to > > > put the logic in the script as part of KIP-113, only for KIP-179 to > have > > to > > > move it back to the controller". > > > > > > Sorry, I misunderstood slightly what you were proposing in KIP-113, so > the > > "waste of time" comment isn't quite right, but I'm still not convinced > that > > KIP-113+KIP-179 (in its current form) ends with a satisfactory result. > > > > Let me elaborate... KIP-113 says that to support reassigning replica > > between log directories across brokers: > > * ... > > * The script sends AlterReplicaDirRequest to those brokers which need to > > move replicas... > > * The script creates reassignment znode in zookeeper. > > * The script retries AlterReplicaDirRequest to those broker... > > * ... > > > > So the ReassignPartitionsCommand still talks to ZK directly, but now it's > > bracketed by calls to the AdminClient. KIP-179 could replace that talking > > to ZK directly with a new call to the AdminClient. But then we've got a > > pretty weird API, where we have to make three AdminClient calls (two of > > them to the same method), to move a replica. I don't really understand > why > > the admin client can't present a single API method to achieve this, and > > encapsulate on the server side the careful sequence of events necessary > to > > coordinate the movement. I understood this position is what Ismael was > > advocating when he said it was better to put the logic in the controller > > than spread between the script and the controller. But maybe I > > misunderstood him. > > > > I have some concern with putting this logic in controller which can be > found in my previous email. Before that is addressed, the script (or > AdminClient) seems to be the simplest place to have this logic. > > I agree it is better to have a single API to achieve both partition and > replica -> dir assignment. I think it is likely that we will find a good > API to do both. I have updated the KIP-113 to remove API alterReplicaDir > from AdminClient interface. > > > > > > > > > I assume that the logic you mentioned is > > > "movement of replica to the specified log directory". This logic (or > the > > > implementation of this logic) resides mainly in the KafkaAdminClient > and > > > broker. The script only needs to parse the json file as appropriate and > > > call the new API in AdminClient as appropriate. The logic in the script > > is > > > therefore not much and can be easily moved to other classes if needed. > > > > > > Can you clarify why this logic, i.e. movement of replica to the > specified > > > log directory, needs to be moved to controller in KIP-179? I think it > can > > > still be done in the script and controller should not need to worry > about > > > log directory of any replica. > > > > > > Thanks, > > > Dong > > > > > >