Hey Jun, Thanks for the comment!
Yes, it should work. The tool can send request to any broker and broker can just write the reassignment znode. My previous intuition is that it may be better to only send this request to controller. But I don't have good reasons for this restriction. My intuition is that we can keep them separate as well. Becket and I have discussed this both offline and in https://github.com/apache/kafka/pull/3621. Currently I don't have a strong opinion on this and I am open to using only one API to do both if someone can come up with a reasonable API signature for this method. For now I have added the method alterReplicaDir() in KafkaAdminClient instead of the AdminClient interface so that the reassignment script can use this method without concluding what the API would look like in AdminClient in the future. Regarding DescribeDirsResponse, I think it is probably OK to have slightly more lag. The script can calculate the lag of the follower replica as Math.max(0, leaderLEO - followerLEO). I agree that it will be slightly less accurate than the current approach in KIP-179. But even with the current approach in KIP-179, the result provided by the script is an approximation anyway, since there is delay from the time that leader returns response to the time that the script collects response from all brokers and prints result to user. I think if the slight difference in the accuracy between the two approaches does not make a difference to the intended use-case of this API, then we probably want to re-use the exiting request/response to keep the protocol simple. Thanks, Dong On Tue, Aug 8, 2017 at 5:56 PM, Jun Rao <j...@confluent.io> wrote: > Hi, Dong, > > I think Tom was suggesting to have the AlterTopicsRequest sent to any > broker, which just writes the reassignment json to ZK. The controller will > pick up the reassignment and act on it as usual. This should work, right? > > Having a separate AlterTopicsRequest and AlterReplicaDirRequest seems > simpler to me. The former is handled by the controller and the latter is > handled by the affected broker. They don't always have to be done together. > Merging the two into a single request probably will make both the api and > the implementation a bit more complicated. If we do keep the two separate > requests, it seems that we should just add AlterReplicaDirRequest to the > AdminClient interface? > > Now, regarding DescribeDirsResponse. I agree that it can be used for the > status reporting in KIP-179 as well. However, it seems that reporting the > log end offset of each replica may not be easy to use. The log end offset > will be returned from different brokers in slightly different time. If > there is continuous producing traffic, the difference in log end offset > between the leader and the follower could be larger than 0 even if the > follower has fully caught up. I am wondering if it's better to instead > return the lag in offset per replica. This way, the status can probably be > reported more reliably. > > Thanks, > > Jun > > On Tue, Aug 8, 2017 at 11:23 AM, Dong Lin <lindon...@gmail.com> wrote: > > > Hey Tom, > > > > Thanks for the quick reply. Please see my comment inline. > > > > On Tue, Aug 8, 2017 at 11:06 AM, Tom Bentley <t.j.bent...@gmail.com> > > wrote: > > > > > Hi Dong, > > > > > > Replies inline, as usual > > > > > > > As I originally envisaged it, KIP-179's support for reassigning > > > partitions > > > > > > > > would have more-or-less taken the logic currently in the > > > > > ReassignPartitionsCommand (that is, writing JSON to the > > > > > ZkUtils.ReassignPartitionsPath) > > > > > and put it behind a suitable network protocol API. Thus it wouldn't > > > > matter > > > > > which broker received the protocol call: It would be acted on by > > > brokers > > > > > being notified of the change in the ZK path, just as currently. > This > > > > would > > > > > have kept the ReassignPartitionsCommand relatively simple, as it > > > > currently > > > > > is. > > > > > > > > > > > > > I am not sure I fully understand your proposal. I think you are > saying > > > that > > > > any broker can receive and handle the AlterTopicRequest. > > > > > > > > > That's right. > > > > > > > > > > Let's say a > > > > non-controller broker received AlterTopicRequest, is this broker > going > > to > > > > send LeaderAndIsrRequest to other brokers? Or is this broker create > the > > > > reassignment znode in zookeper? > > > > > > > > > Exactly: It's going to write some JSON to the relevant znode. Other > > brokers > > > will get notified by zk when the contents of this znode changes, and do > > as > > > they do now. This is what the tool/script does now. > > > > > > I will confess that I don't completely understand the role of > > > LeaderAndIsrRequest, since the current code just seems to write to the > > > znode do get the brokers to do the reassignment. If you could explain > the > > > role of LeaderAndIsrRequest that would be great. > > > > > > > Currently only the controller will listen to the reassignment znode and > > sends LeaderAndIsrRequest and StopReplicaRequest to brokers in order to > > complete reassignment. Brokers won't need to listen to zookeeper for any > > reassignment -- brokers only reacts to the request from controller. > > Currently Kafka's design replies a lot on the controller to keep a > > consistent view of who are the leader of partitions and what is the ISR > > etc. It will be a pretty drastic change, if not impossible, for the > script > > to reassign partitions without going through controller. > > > > Thus I think it is likely that your AlterTopicsRequest can only be sent > to > > controller. Then the controller can create the reassignment znode in > > zookeeper so that the information is persisted across controller fail > over. > > I haven't think through this in detail though. > > > > > > > > > > > > > > > > I may have missed it. But I couldn't find > > > > the explanation of AlterTopicRequest handling in KIP-179. > > > > > > > > > > You're right, it doesn't go into that much detail. I will fix that. > > > > > > > > > > > > > > > > KIP-113 is obviously seeking to make more radical changes. The > > > algorithm > > > > > described for moving a replica to a particular directory on a > > different > > > > > broker ( > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113: > > > > > Supportreplicasmovementbetweenlogdirectories-2) > > > > > Howtoreassignreplicabetweenlogdirectoriesacrossbrokers > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > > > 113%3A+Support+replicas+movement+between+log+directories#KIP-113: > > > > > Supportreplicasmovementbetweenlogdirectories-2% > > > > > 29Howtoreassignreplicabetweenlogdirectoriesacrossbrokers>) > > > > > involves both sending AlterReplicaDirRequest to "the" broker (the > > > > receiving > > > > > broker, I assume, but it's not spelled out), _as well as_ writing > to > > > the > > > > ZK > > > > > node. > > > > > > > > > > This assumes the script (ReassignPartitionsCommand) has direct > access > > > to > > > > > ZooKeeper, which is what KIP-179 is seeking to deprecate. It seems > a > > > > waste > > > > > of time to put the logic in the script as part of KIP-113, only for > > > > KIP-179 > > > > > to have to move it back to the controller. > > > > > > > > > > > > > I am not sure I understand what you mean by "It seems a waste of time > > to > > > > put the logic in the script as part of KIP-113, only for KIP-179 to > > have > > > to > > > > move it back to the controller". > > > > > > > > > Sorry, I misunderstood slightly what you were proposing in KIP-113, so > > the > > > "waste of time" comment isn't quite right, but I'm still not convinced > > that > > > KIP-113+KIP-179 (in its current form) ends with a satisfactory result. > > > > > > Let me elaborate... KIP-113 says that to support reassigning replica > > > between log directories across brokers: > > > * ... > > > * The script sends AlterReplicaDirRequest to those brokers which need > to > > > move replicas... > > > * The script creates reassignment znode in zookeeper. > > > * The script retries AlterReplicaDirRequest to those broker... > > > * ... > > > > > > So the ReassignPartitionsCommand still talks to ZK directly, but now > it's > > > bracketed by calls to the AdminClient. KIP-179 could replace that > talking > > > to ZK directly with a new call to the AdminClient. But then we've got a > > > pretty weird API, where we have to make three AdminClient calls (two of > > > them to the same method), to move a replica. I don't really understand > > why > > > the admin client can't present a single API method to achieve this, and > > > encapsulate on the server side the careful sequence of events necessary > > to > > > coordinate the movement. I understood this position is what Ismael was > > > advocating when he said it was better to put the logic in the > controller > > > than spread between the script and the controller. But maybe I > > > misunderstood him. > > > > > > > I have some concern with putting this logic in controller which can be > > found in my previous email. Before that is addressed, the script (or > > AdminClient) seems to be the simplest place to have this logic. > > > > I agree it is better to have a single API to achieve both partition and > > replica -> dir assignment. I think it is likely that we will find a good > > API to do both. I have updated the KIP-113 to remove API alterReplicaDir > > from AdminClient interface. > > > > > > > > > > > > > > I assume that the logic you mentioned is > > > > "movement of replica to the specified log directory". This logic (or > > the > > > > implementation of this logic) resides mainly in the KafkaAdminClient > > and > > > > broker. The script only needs to parse the json file as appropriate > and > > > > call the new API in AdminClient as appropriate. The logic in the > script > > > is > > > > therefore not much and can be easily moved to other classes if > needed. > > > > > > > > Can you clarify why this logic, i.e. movement of replica to the > > specified > > > > log directory, needs to be moved to controller in KIP-179? I think it > > can > > > > still be done in the script and controller should not need to worry > > about > > > > log directory of any replica. > > > > > > > > Thanks, > > > > Dong > > > > > > > > > >