Thanks Kowshik, the answers are making sense. Some follow-ups: On Tue, Mar 31, 2020 at 6:51 PM Jun Rao <j...@confluent.io> wrote:
> Hi, Kowshik, > > Thanks for the KIP. Looks good overall. A few comments below. > > 100. UpdateFeaturesRequest/UpdateFeaturesResponse > 100.1 Since this request waits for responses from brokers, should we add a > timeout in the request (like createTopicRequest)? > 100.2 The response schema is a bit weird. Typically, the response just > shows an error code and an error message, instead of echoing the request. > 100.3 Should we add a separate request to list/describe the existing > features? > 100.4 We are mixing ADD_OR_UPDATE and DELETE in a single request. For > DELETE, the version field doesn't make sense. So, I guess the broker just > ignores this? An alternative way is to have a separate > DeleteFeaturesRequest > 100.5 In UpdateFeaturesResponse, we have "The monotonically increasing > version of the metadata for finalized features." I am wondering why the > ordering is important? > 100.6 Could you specify the required ACL for this new request? > > 101. For the broker registration ZK node, should we bump up the version in > the json? > > 102. For the /features ZK node, not sure if we need the epoch field. Each > ZK node has an internal version field that is incremented on every update. > > 103. "Enabling the actual semantics of a feature version cluster-wide is > left to the discretion of the logic implementing the feature (ex: can be > done via dynamic broker config)." Does that mean the broker registration ZK > node will be updated dynamically when this happens? > > 104. UpdateMetadataRequest > 104.1 It would be useful to describe when the feature metadata is included > in the request. My understanding is that it's only included if (1) there is > a change to the finalized feature; (2) broker restart; (3) controller > failover. > 104.2 The new fields have the following versions. Why are the versions 3+ > when the top version is bumped to 6? > "fields": [ > {"name": "Name", "type": "string", "versions": "3+", > "about": "The name of the feature."}, > {"name": "Version", "type": "int64", "versions": "3+", > "about": "The finalized version for the feature."} > ] > > 105. kafka-features.sh: Instead of using update/delete, perhaps it's better > to use enable/disable? > > Jun > > On Tue, Mar 31, 2020 at 5:29 PM Kowshik Prakasam <kpraka...@confluent.io> > wrote: > > > Hey Boyang, > > > > Thanks for the great feedback! I have updated the KIP based on your > > feedback. > > Please find my response below for your comments, look for sentences > > starting > > with "(Kowshik)" below. > > > > > > > 1. "When is it safe for the brokers to begin handling EOS traffic" > could > > be > > > converted as "When is it safe for the brokers to start serving new > > > Exactly-Once(EOS) semantics" since EOS is not explained earlier in the > > > context. > > > > (Kowshik): Great point! Done. > > > > 2. In the *Explanation *section, the metadata version number part seems > a > > > bit blurred. Could you point a reference to later section that we going > > to > > > store it in Zookeeper and update it every time when there is a feature > > > change? > > > > (Kowshik): Great point! Done. I've added a reference in the KIP. > > > > > > > 3. For the feature downgrade, although it's a Non-goal of the KIP, for > > > features such as group coordinator semantics, there is no legal > scenario > > to > > > perform a downgrade at all. So having downgrade door open is pretty > > > error-prone as human faults happen all the time. I'm assuming as new > > > features are implemented, it's not very hard to add a flag during > feature > > > creation to indicate whether this feature is "downgradable". Could you > > > explain a bit more on the extra engineering effort for shipping this > KIP > > > with downgrade protection in place? > > > > (Kowshik): Great point! I'd agree and disagree here. While I agree that > > accidental > > downgrades can cause problems, I also think sometimes downgrades should > > be allowed for emergency reasons (not all downgrades cause issues). > > It is just subjective to the feature being downgraded. > > > > To be more strict about feature version downgrades, I have modified the > KIP > > proposing that we mandate a `--force-downgrade` flag be used in the > > UPDATE_FEATURES api > > and the tooling, whenever the human is downgrading a finalized feature > > version. > > Hopefully this should cover the requirement, until we find the need for > > advanced downgrade support. > > > +1 for adding this flag. > > > 4. "Each broker’s supported dictionary of feature versions will be > > defined > > > in the broker code." So this means in order to restrict a certain > > feature, > > > we need to start the broker first and then send a feature gating > request > > > immediately, which introduces a time gap and the intended-to-close > > feature > > > could actually serve request during this phase. Do you think we should > > also > > > support configurations as well so that admin user could freely roll up > a > > > cluster with all nodes complying the same feature gating, without > > worrying > > > about the turnaround time to propagate the message only after the > cluster > > > starts up? > > > > (Kowshik): This is a great point/question. One of the expectations out of > > this KIP, which is > > already followed in the broker, is the following. > > - Imagine at time T1 the broker starts up and registers it’s presence in > > ZK, > > along with advertising it’s supported features. > > - Imagine at a future time T2 the broker receives the > > UpdateMetadataRequest > > from the controller, which contains the latest finalized features as > > seen by > > the controller. The broker validates this data against it’s supported > > features to > > make sure there is no mismatch (it will shutdown if there is an > > incompatibility). > > > > It is expected that during the time between the 2 events T1 and T2, the > > broker is > > almost a silent entity in the cluster. It does not add any value to the > > cluster, or carry > > out any important broker activities. By “important”, I mean it is not > doing > > mutations > > on it’s persistence, not mutating critical in-memory state, won’t be > > serving > > produce/fetch requests. Note it doesn’t even know it’s assigned > partitions > > until > > it receives UpdateMetadataRequest from controller. Anything the broker is > > doing up > > until this point is not damaging/useful. > > > > I’ve clarified the above in the KIP, see this new section: > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features#KIP-584:Versioningschemeforfeatures-Incompatiblebrokerlifetime > > . > > My point is that during a bootstrapping stage of a cluster, we could not pick the desired feature version as no controller is actively handling our request. But anyway, I think this is a rare case to discuss, and the added paragraph looks good :) > > > 5. "adding a new Feature, updating or deleting an existing Feature", > may > > be > > > I misunderstood something, I thought the features are defined in broker > > > code, so admin could not really create a new feature? > > > > (Kowshik): Great point! You understood this right. Here adding a feature > > means we are > > adding a cluster-wide finalized *max* version for a feature that was > > previously never finalized. > > I have clarified this in the KIP now. > > > > > 6. I think we need a separate error code like > FEATURE_UPDATE_IN_PROGRESS > > to > > > reject a concurrent feature update request. > > > > (Kowshik): Great point! I have modified the KIP adding the above (see > > 'Tooling support -> Admin API changes'). > > > > > 7. I think we haven't discussed the alternative solution to pass the > > > feature information through Zookeeper. Is that mentioned in the KIP to > > > justify why using UpdateMetadata is more favorable? > > > > (Kowshik): Nice question! The broker reads finalized feature info stored > in > > ZK, > > only during startup when it does a validation. When serving > > `ApiVersionsRequest`, the > > broker does not read this info from ZK directly. I'd imagine the risk is > > that it can increase > > the ZK read QPS which can be a bottleneck for the system. Today, in Kafka > > we use the > > controller to fan out ZK updates to brokers and we want to stick to that > > pattern to avoid > > the ZK read bottleneck when serving `ApiVersionsRequest`. > > Feature changes should be roughly the same frequency as config changes. Today, the dynamic configuration changes are propagated via Zookeeper. So I guess propagating through UpdateMetadata doesn't get us more benefits, while going through ZK notification should be a simpler solution. > > 8. I was under the impression that user could configure a range of > > > supported versions, what's the trade-off for allowing single finalized > > > version only? > > > > (Kowshik): Great question! The finalized version of a feature basically > > refers to > > the cluster-wide finalized feature "maximum" version. For example, if the > > 'group_coordinator' feature > > has the finalized version set to 10, then, it means that cluster-wide all > > versions upto v10 are > > supported for this feature. However, note that if some version (ex: v0) > > gets deprecated > > for this feature, then we don’t convey that using this scheme (also > > supporting deprecation is a non-goal). > > > > (Kowshik): I’ve now modified the KIP at all points, refering to finalized > > feature "maximum" versions. > > > Understood, I don't feel strong about deprecation, but does the current KIP keep the door open for future improvements if someone has a need for feature deprecation? Could we briefly discuss about it in the future work section? > > > 9. One minor syntax fix: Note that here the "client" here may be a > > producer > > > > (Kowshik): Great point! Done. > > > > > > Cheers, > > Kowshik > > > > > > On Tue, Mar 31, 2020 at 1:17 PM Boyang Chen <reluctanthero...@gmail.com> > > wrote: > > > > > Hey Kowshik, > > > > > > thanks for the revised KIP. Got a couple of questions: > > > > > > 1. "When is it safe for the brokers to begin handling EOS traffic" > could > > be > > > converted as "When is it safe for the brokers to start serving new > > > Exactly-Once(EOS) semantics" since EOS is not explained earlier in the > > > context. > > > > > > 2. In the *Explanation *section, the metadata version number part > seems a > > > bit blurred. Could you point a reference to later section that we going > > to > > > store it in Zookeeper and update it every time when there is a feature > > > change? > > > > > > 3. For the feature downgrade, although it's a Non-goal of the KIP, for > > > features such as group coordinator semantics, there is no legal > scenario > > to > > > perform a downgrade at all. So having downgrade door open is pretty > > > error-prone as human faults happen all the time. I'm assuming as new > > > features are implemented, it's not very hard to add a flag during > feature > > > creation to indicate whether this feature is "downgradable". Could you > > > explain a bit more on the extra engineering effort for shipping this > KIP > > > with downgrade protection in place? > > > > > > 4. "Each broker’s supported dictionary of feature versions will be > > defined > > > in the broker code." So this means in order to restrict a certain > > feature, > > > we need to start the broker first and then send a feature gating > request > > > immediately, which introduces a time gap and the intended-to-close > > feature > > > could actually serve request during this phase. Do you think we should > > also > > > support configurations as well so that admin user could freely roll up > a > > > cluster with all nodes complying the same feature gating, without > > worrying > > > about the turnaround time to propagate the message only after the > cluster > > > starts up? > > > > > > 5. "adding a new Feature, updating or deleting an existing Feature", > may > > be > > > I misunderstood something, I thought the features are defined in broker > > > code, so admin could not really create a new feature? > > > > > > 6. I think we need a separate error code like > FEATURE_UPDATE_IN_PROGRESS > > to > > > reject a concurrent feature update request. > > > > > > 7. I think we haven't discussed the alternative solution to pass the > > > feature information through Zookeeper. Is that mentioned in the KIP to > > > justify why using UpdateMetadata is more favorable? > > > > > > 8. I was under the impression that user could configure a range of > > > supported versions, what's the trade-off for allowing single finalized > > > version only? > > > > > > 9. One minor syntax fix: Note that here the "client" here may be a > > producer > > > > > > Boyang > > > > > > On Mon, Mar 30, 2020 at 4:53 PM Colin McCabe <cmcc...@apache.org> > wrote: > > > > > > > On Thu, Mar 26, 2020, at 19:24, Kowshik Prakasam wrote: > > > > > Hi Colin, > > > > > > > > > > Thanks for the feedback! I've changed the KIP to address your > > > > > suggestions. > > > > > Please find below my explanation. Here is a link to KIP 584: > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features > > > > > . > > > > > > > > > > 1. '__data_version__' is the version of the finalized feature > > metadata > > > > > (i.e. actual ZK node contents), while the '__schema_version__' is > the > > > > > version of the schema of the data persisted in ZK. These serve > > > different > > > > > purposes. '__data_version__' is is useful mainly to clients during > > > reads, > > > > > to differentiate between the 2 versions of eventually consistent > > > > 'finalized > > > > > features' metadata (i.e. larger metadata version is more recent). > > > > > '__schema_version__' provides an additional degree of flexibility, > > > where > > > > if > > > > > we decide to change the schema for '/features' node in ZK (in the > > > > future), > > > > > then we can manage broker roll outs suitably (i.e. > > > > > serialization/deserialization of the ZK data can be handled > safely). > > > > > > > > Hi Kowshik, > > > > > > > > If you're talking about a number that lets you know if data is more > or > > > > less recent, we would typically call that an epoch, and not a > version. > > > For > > > > the ZK data structures, the word "version" is typically reserved for > > > > describing changes to the overall schema of the data that is written > to > > > > ZooKeeper. We don't even really change the "version" of those > schemas > > > that > > > > much, since most changes are backwards-compatible. But we do include > > > that > > > > version field just in case. > > > > > > > > I don't think we really need an epoch here, though, since we can just > > > look > > > > at the broker epoch. Whenever the broker registers, its epoch will > be > > > > greater than the previous broker epoch. And the newly registered > data > > > will > > > > take priority. This will be a lot simpler than adding a separate > epoch > > > > system, I think. > > > > > > > > > > > > > > 2. Regarding admin client needing min and max information - you are > > > > right! > > > > > I've changed the KIP such that the Admin API also allows the user > to > > > read > > > > > 'supported features' from a specific broker. Please look at the > > section > > > > > "Admin API changes". > > > > > > > > Thanks. > > > > > > > > > > > > > > 3. Regarding the use of `long` vs `Long` - it was not deliberate. > > I've > > > > > improved the KIP to just use `long` at all places. > > > > > > > > Sounds good. > > > > > > > > > > > > > > 4. Regarding kafka.admin.FeatureCommand tool - you are right! I've > > > > updated > > > > > the KIP sketching the functionality provided by this tool, with > some > > > > > examples. Please look at the section "Tooling support examples". > > > > > > > > > > Thank you! > > > > > > > > > > > > Thanks, Kowshik. > > > > > > > > cheers, > > > > Colin > > > > > > > > > > > > > > > > > > > Cheers, > > > > > Kowshik > > > > > > > > > > On Wed, Mar 25, 2020 at 11:31 PM Colin McCabe <cmcc...@apache.org> > > > > wrote: > > > > > > > > > > > Thanks, Kowshik, this looks good. > > > > > > > > > > > > In the "Schema" section, do we really need both > __schema_version__ > > > and > > > > > > __data_version__? Can we just have a single version field here? > > > > > > > > > > > > Shouldn't the Admin(Client) function have some way to get the min > > and > > > > max > > > > > > information that we're exposing as well? I guess we could have > > min, > > > > max, > > > > > > and current. Unrelated: is the use of Long rather than long > > > deliberate > > > > > > here? > > > > > > > > > > > > It would be good to describe how the command line tool > > > > > > kafka.admin.FeatureCommand will work. For example the flags that > > it > > > > will > > > > > > take and the output that it will generate to STDOUT. > > > > > > > > > > > > cheers, > > > > > > Colin > > > > > > > > > > > > > > > > > > On Tue, Mar 24, 2020, at 17:08, Kowshik Prakasam wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I've opened KIP-584 < > > https://issues.apache.org/jira/browse/KIP-584 > > > > > > > > > > > which > > > > > > > is intended to provide a versioning scheme for features. I'd > like > > > to > > > > use > > > > > > > this thread to discuss the same. I'd appreciate any feedback on > > > this. > > > > > > > Here > > > > > > > is a link to KIP-584: > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-584%3A+Versioning+scheme+for+features > > > > > > > . > > > > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > Kowshik > > > > > > > > > > > > > > > > > > > > > > > > > > > >