Hi Tom, A high-level point for discussion before going into the details. The proposed protocol API `alterTopics` has 2 types of operations:
1. Operations that cause data movement (or deletion): increase/decrease of replication factor and partition reassignment. These are currently done by `kafka-reassign-partitions.sh` and having a way to check the progress is useful. 2. Operations that don't involve data movement: increase the number of partitions and update topic configs (this is not in the proposal, but could be). These are currently done by `kafka-topics.sh` and they don't usually take very long so progress reporting is less important. It would be worth thinking whether having 2 separate protocol APIs would be better. I can see pros and cons, so I'd be interested in what you and others think. Ismael On Tue, Aug 1, 2017 at 10:19 AM, Tom Bentley <t.j.bent...@gmail.com> wrote: > I have added the timeout I mentioned before, because it makes the > implementation of topic alteration more symmetric with the topic creation > APIs. > > I have also added a section ("Policy") on retrofitting the > CreateTopicPolicy's rules to topic alteration requests, and made a few > other minor fixes. > > I've still not factored out the topic name from the ReplicaStatusRequest > (Ismael, do you have an opinion about that?) > > If any one else has some feedback on this KIP that would be great. > Otherwise, I would like to start a vote on this KIP before the end of the > week. > > Thanks, > > Tom > > On 26 July 2017 at 11:45, Tom Bentley <t.j.bent...@gmail.com> wrote: > > > I've updated the KIP to fix those niggles, but I've not factored out the > > topic name from the ReplicaStatusRequest, yet. > > > > Looking at the topic creation APIs in more detail, the > CreateTopicsOptions > > has > > > > * `shouldValidateOnly()`, which would make a lot of sense for the alter > > topic APIs > > * `timeoutMs()`, which I'm not sure sure about... > > > > Topic creation doesn't require shifting replicas between brokers so it's > > reasonable to support timeout, because we don't expect it to take very > long. > > > > Topic alteration usually takes a while because we are going to have to > > move replicas. Since we're adding a whole API to track the progress of > that > > replication, I'm inclined to think that having a timeout is a bit > pointless. > > > > But should the replicaStatus() API have a timeout? I suppose it probably > > should. > > > > > > On 26 July 2017 at 10:58, Tom Bentley <t.j.bent...@gmail.com> wrote: > > > >> Thanks Paolo, > >> > >> * in the "Public Interfaces" section you wrote > >>> alterTopics(Set<AlterTopics>) but then a collection is used (instead > of a > >>> set) in the Proposed Changes section. I'm ok with collection. > >>> > >> > >> Agree it should be Collection. > >> > >> > >>> * in the summary of the alterTopics method you say "The request can > >>> change the number of partitions, replication factor and/or the > partition > >>> assignments." I think that the "and/or" is misleading (at least for > me). > >>> For the TopicCommand tool you can specify partitions AND replication > factor > >>> ... OR partition assignments (but not for example partitions AND > >>> replication factor AND partition assignments). Maybe it could be "The > >>> request can change the number of partitions and the related replication > >>> factor or specifying a partition assignments." > >>> > >> > >> Is there a reason why we can't support changing all three at once? It > >> certainly makes conceptual sense to, say, increase the number of > partitions > >> and replication factor, and specify how you want the partitions > assigned. > >> And doing two separate calls would be less efficient as we sync new > >> replicas on brokers only to then move them somewhere else. > >> > >> If there is a reason we don't want to support changing all three, then > we > >> can return the error INVALID_REQUEST (42). That would allow us to > support > >> changing all three at some time in the future, without having to change > the > >> API. > >> > >> > >>> * I know that it would be a breaking change in the Admin Client API > >>> but why having an AlteredTopic class which is quite similar to the > already > >>> existing NewTopic class ? I know that using NewTopic for the alter > method > >>> could be misleading. What about renaming NewTopic in something like > >>> AdminTopic ? At same time it means that the TopicDetails class (which > you > >>> can get from the current NewTopic) should be outside the > >>> CreateTopicsRequest because it could be used in the AlterTopicsRequest > as > >>> well. > >>> > >> > >> One problem with this is it tends to inhibit future API changes for > >> either newTopics() or alterTopics(), because any common class needs to > make > >> sense in both contexts. > >> > >> For createTopics() we get to specify some configs (the > >> Map<String,String>), but since the AdminClient already has > alterConfigs() > >> for changing topic configs after topic creation I don't think it's > right to > >> also support changing those configs via alterTopics() as well. But > having > >> them in a common AdminTopic class would suggest that that was supported. > >> Yes, alterTopics could return INVALID_REQUEST if it was given topic > >> configs, but this is just making the API harder to use since it is > >> weakening of the type safety of the API. > >> > >> I suppose we *could* write a common TopicDetails class and make the > >> existing nested one extend the common one, with deprecations, to > eventually > >> remove the nested one. > >> > >> > >> > >>> * A typo in the ReplicaStatus : gpartition() instead of partition() > >>> * In the AlterTopicRequets > >>> * the replication factor should be INT16 > >>> > >> > >> Ah, thanks! > >> > >> > >>> * I would use same fields name as CreateTopicsRequest (they are > >>> quite similar) > >>> * What's the broker id in the ReplicaStatusRequest ? > >>> > >> > >> It's the broker, which is expected to have a replica of the given > >> partition, that we're querying the status of. It is necessary because > we're > >> asking the _leader_ for the partition about (a subset of) the status of > the > >> followers. Put another way, to identify the replica of a particular > >> partition on a particular broker we need the tuple (topic, partition, > >> broker). > >> > >> If we were always interested in the status of the partition across all > >> brokers we could omit the broker part. But for reassignment we actually > >> only care about a subset of the brokers. > >> > >> > >>> * Thinking aloud, could make sense having "Partition" in the > >>> ReplicaStatusRequest as an array so that I can specify in only one > request > >>> the status for partitions I'm interested in, in order to avoid e > request > >>> for each partition for the same topic. Maybe empty array could mean .. > >>> "give me status for all partitions of this topic". Of course it means > that > >>> the ReplicaStatusResponse should change because we should have an array > >>> with partition, broker, lag and so on > >>> > >> > >> You already can specify in one request the status for all the partitions > >> you're interested in (ReplicaStatus can be repeated/is an array field). > >> > >> We could factor out the topic to avoid repeating it, which would be more > >> efficient when we're querying the status of many partitions of a topic > >> and/or there are many brokers holding replicas. In other words, we could > >> factor it to look like this: > >> > >> ReplicaStatusRequest => [TopicReplicas] > >> TopicReplicas => Topic [PartitionReplica] > >> Topic => string > >> PartitionReplica => Partition Broker > >> Partition => int32 > >> Broker => int32 > >> > >> Does this make sense to you? > >> > >> > > >