Hi Boyang, Thanks for the responses. Follow up comments on a couple of those:
4) Can you provide some more details on the scenarios where file watcher is useful. You mentioned hybrid, but it is not clear to me why a watcher that reloads 99% of the time would be useful. There are a few cases to consider: 4a) Refresh SSL stores prior to expiry. In this case, time-interval-based reloading makes sense and this is similar to refresh of other security credentials. It seems reasonable to expect that stores are updated well before expiry and that the refresh interval would be configured accordingly. 4b) Remove a compromised client certificate from the trust store. This is a critical case. With the current code, you send an admin client request and get feedback on whether the update happened. A watcher that may or may not work is pretty much useless. The only reliable way to handle this case after this KIP would be to change the file name and send an admin client request. 4c) Change broker key store, which also perhaps requires change to the truststore. This is the ordering issue introduced by this KIP. This is also the issue that David Jacot brought up. With the current code, you update keystore and truststore and send admin client requests to perform the update and hence you have total control on when each update is performed and whether they are performed together. With the proposed KIP, we lose this control. Like 4b), the only reliable way after this KIP to handle this case would be to change filenames and send an admin client request. 4d) Add a new client certificate to the broker truststore. This is the only case where a file watcher seems useful to reload sooner. To summarize, two of the four cases are no longer reliable without file name change and it would be better to document these as limitations. These are rare cases where file name change seems reasonable. Periodic refresh, which is the most common case and one that we would want to support in clients some day can be supported without a file watcher. That leaves 4d) which I am not sure justifies the use of a file watcher that cannot be disabled. 5) Dynamic updates are restricted to ensure brokers don't become unusable after an update. For example, we don't allow the distinguished name (DN) in the certificate that is used to determine user identity to be changed dynamically. Similarly, you cannot remove host names included in the certificates since that may fail client validation. For certificates used for inter-broker communication, we validate the key store against the trust store to ensure that brokers are able to communicate to each other. With admin client based updates, we attempt an update and provide feedback so that the user can take action. The KIP needs more detail on how errors will be handled with the watcher or periodic updates. It also needs more detail on the new requirements being introduced for ordering of updates of files on the file system or clearly state the cases which will no longer be supported without file name change. On Thu, Jan 7, 2021 at 6:46 PM Boyang Chen <reluctanthero...@gmail.com> wrote: > Hey David, thanks for the feedback. > > On Thu, Jan 7, 2021 at 2:37 AM David Jacot <dja...@confluent.io> wrote: > > > Hi Boyang, > > > > Thanks for the KIP. I am fine with it in general. I just have a few > > comments. > > > > With the proposal, we don't have the guarantee that both the new keystore > > and the new truststore will be picked up together so we may end up with > > the new keystore and the old truststore for a short period of time, or > > permanently > > if the second one can't be reloaded for any reason. > > > > This could disallow clients to authenticate for a while if the new > keystore > > and the > > new trustore are not crafted to work with their old versions. > > > > I wonder how this would work in practice. Do we already have guards in > > place to avoid this or could we add something to ensure that listeners > are > > updated only if both the truststore and the keystore works with each > other? > > > > We don't have this issue today as both the truststore and the keystore > are > > reloaded when the AlterConfig RPC is received so the admin can control > > this process. It is all or nothing. > > > > I'm not sure how we achieve that today, if we are updating both key store > and trust store > in the same request, it is still possible that one update succeeded but > another failed. Does users > have awareness to make the guarantee by themselves? My > point is that today there is no reliable way to achieve atomic update > either, and hoping users > to make the correct sequence of decisions would be hard. > > I think that this is acceptable but it is worth clearly mentioning that > > there is no > > guarantee from that regard in the KIP, and later in the doc. Perhaps, we > > could > > also mention that updating them in place is not a best practice and that > > using > > new paths gives better control to the admin. > > > > Best, > > David > > > > On Wed, Jan 6, 2021 at 6:55 PM Jason Gustafson <ja...@confluent.io> > wrote: > > > > > Thanks Boyang. Someone mentioned my email never showed up, but > basically > > I > > > suggested tying the refresh configuration more directly to the > > > configurations it would affect. I'm happy with the updates. > > > > > > -Jason > > > > > > On Tue, Jan 5, 2021 at 8:34 PM Boyang Chen <reluctanthero...@gmail.com > > > > > wrote: > > > > > > > Thanks Jason for the feedback. I separated the time configs for key > > store > > > > and trust store, and rename the configs as you proposed. > > > > > > > > Best, > > > > Boyang > > > > > > > > On Mon, Dec 14, 2020 at 3:47 PM Boyang Chen < > > reluctanthero...@gmail.com> > > > > wrote: > > > > > > > > > Hey there, > > > > > > > > > > bumping up this thread to see if there are further questions > > regarding > > > > the > > > > > updated proposal. > > > > > > > > > > Best, > > > > > Boyang > > > > > > > > > > On Thu, Dec 10, 2020 at 11:52 AM Boyang Chen < > > > reluctanthero...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > >> After some offline discussions, we believe that it's the right > > > direction > > > > >> to go by doing a hybrid approach which includes both file-watch > > > trigger > > > > and > > > > >> interval based reloading. The former guarantees a swift change in > > 99% > > > > time, > > > > >> while the latter provides a time-based guarantee in the worst case > > > when > > > > the > > > > >> file-watch does not take effect. The current default reloading > > > interval > > > > is > > > > >> set to 5 min. I have updated the KIP and ticket, feel free to > check > > > out > > > > and > > > > >> see if it makes sense. > > > > >> > > > > >> Best, > > > > >> Boyang > > > > >> > > > > >> On Tue, Dec 8, 2020 at 8:58 PM Boyang Chen < > > > reluctanthero...@gmail.com> > > > > >> wrote: > > > > >> > > > > >>> Hey Gwen, thanks for the feedback. > > > > >>> > > > > >>> On Sun, Dec 6, 2020 at 10:06 PM Gwen Shapira <g...@confluent.io> > > > > wrote: > > > > >>> > > > > >>>> Agree with Igor. IIRC, we also encountered cases where filewatch > > was > > > > >>>> not triggered as expected. An interval will give us a better > > > > >>>> worse-case scenario that is easily controlled by the Kafka > admin. > > > > >>>> > > > > >>>> Are the cases you were referring to happening in the cloud > > > > environment? > > > > >>> Should we investigate instead of simply assuming the standard API > > > won't > > > > >>> work? I checked around and found a similar complaint here > > > > >>> <https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/>. > > > > >>> > > > > >>> I would be partially agreeing that we want to have a reliable > > > approach > > > > >>> for all different operating systems in general, but would be > great > > if > > > > we > > > > >>> could reach a quantitative measure of file-watch success rate if > > > > possible > > > > >>> for us to make the call. Eventually, the benefit of file-watch is > > > more > > > > >>> prompt reaction time and less configuration to the broker. > > > > >>> > > > > >>>> Gwen > > > > >>>> > > > > >>>> On Sun, Dec 6, 2020 at 8:17 AM Igor Soarez <i...@soarez.me> wrote: > > > > >>>> > > > > > >>>> > > > > > >>>> > > > The proposed change relies on a file watch, why not also > > have > > > a > > > > >>>> polling > > > > >>>> > > > interval to check the file for changes? > > > > >>>> > > > > > > > >>>> > > > The periodical check could work, the slight downside is > that > > > we > > > > >>>> need > > > > >>>> > > additional configurations to schedule the interval. Do you > > think > > > > the > > > > >>>> > > file-watch approach has any extra overhead than the interval > > > based > > > > >>>> solution? > > > > >>>> > > > > > >>>> > I don't think so. The reason I'm asking this is the KIP > > currently > > > > >>>> includes: > > > > >>>> > > > > > >>>> > "When the file watch does not work for unknown reason, user > > > could > > > > >>>> still try to change the store path in an explicit AlterConfig > call > > > in > > > > the > > > > >>>> worst case." > > > > >>>> > > > > > >>>> > Having the interval in addition to the file watch could result > > in > > > a > > > > >>>> better worst case scenario. > > > > >>>> > I understand it would require introducing at least one new > > > > >>>> configuration for the interval, so maybe this doesn't have to > > solved > > > > in > > > > >>>> this KIP. > > > > >>>> > > > > > >>>> > -- > > > > >>>> > Igor > > > > >>>> > > > > > >>>> > On Fri, Dec 4, 2020, at 5:14 PM, Boyang Chen wrote: > > > > >>>> > > Hey Igor, thanks for the feedback. > > > > >>>> > > > > > > >>>> > > On Fri, Dec 4, 2020 at 5:24 AM Igor Soarez <i...@soarez.me> > > wrote: > > > > >>>> > > > > > > >>>> > > > Hi Boyang, > > > > >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > What happens if the file is changed into an invalid store? > > > Does > > > > >>>> the > > > > >>>> > > > previous store stay in use? > > > > >>>> > > > > > > > >>>> > > > If the reload fails, the previous store should be > > effective. I > > > > >>>> will state > > > > >>>> > > that in the KIP. > > > > >>>> > > > > > > >>>> > > > > > > >>>> > > > Thanks, > > > > >>>> > > > > > > > >>>> > > > -- > > > > >>>> > > > Igor > > > > >>>> > > > > > > > >>>> > > > On Fri, Dec 4, 2020, at 1:28 AM, Boyang Chen wrote: > > > > >>>> > > > > Hey there, > > > > >>>> > > > > > > > > >>>> > > > > I would like to start the discussion thread for KIP-687: > > > > >>>> > > > > > > > > >>>> > > > > > > > >>>> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-687%3A+Automatic+Reloading+of+Security+Store > > > > >>>> > > > > > > > > >>>> > > > > This KIP is trying to deprecate the AlterConfigs API > > support > > > > of > > > > >>>> updating > > > > >>>> > > > > the security store by reloading path in-place, and > replace > > > > with > > > > >>>> a > > > > >>>> > > > > file-watch mechanism inside the broker. Let me know what > > you > > > > >>>> think. > > > > >>>> > > > > > > > > >>>> > > > > Best, > > > > >>>> > > > > Boyang > > > > >>>> > > > > > > > > >>>> > > > > > > > >>>> > > > > > > >>>> > > > > >>>> > > > > >>>> > > > > >>>> -- > > > > >>>> Gwen Shapira > > > > >>>> Engineering Manager | Confluent > > > > >>>> 650.450.2760 | @gwenshap > > > > >>>> Follow us: Twitter | blog > > > > >>>> > > > > >>> > > > > > > > > > >