Hello Skander, I think KIP-687 while accepted never got implemented so this is very much an issue on the broker side as well. Perhaps this can supersede it?
I also feel we should also adopt some sort of last-run-at and failure metrics as it would be useful for operators of both the brokers and clients to alert on them. Regards, Gaurav On 2026/03/05 10:23:23 Skander Soltane wrote: > Hello Jakub, Gaurav, > > Thank you both for your feedback. > > On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula <[email protected]> wrote: > > > Hi Skander and Jakub, > > > > Please find my comments inline > > > > > On 4 Mar 2026, at 17:58, Jakub Scholz <[email protected]> wrote: > > > > > > Hi Skander, > > > > > > Thanks for the KIP. Here are some of my thoughts on it ... > > > > > > I think using a poller instead of the WatchService is a good choice. In > > the > > > previous KIP (KIP-1119), this was my main concern about why it would not > > > work. > > > > > > However, are you sure that Files.getLastModifiedTime() will work on > > > Kubernetes with something like a mounted ConfigMap or Secret? The file > > > itself is a symlink, and its dates do not change when a Secret is > > updated. > > > At least when observed with something like bash's stat command. Only the > > > dates of the file that the symlink points to change. So, out of my head, > > > I'm not sure which timestamp Java would give you (I haven't tried it, to > > be > > > honest - I'm just wondering if you did and if it really works). If the > > > timestamp doesn't work, maybe one can just read the content of the file > > and > > > store some checksum to compare it with in the next check? > > > > GN1: I think `Files.getLastModifiedTime()` has an overload for accepting > > LinkOption and if none it passed it follows symlinks. > > We should be fine as long as the timestamp for the file that the symlink > > points to is updated. > > > > I think Gaurav is correct, according to the JavaDoc of getLastModifiedTime, > “By default, symbolic links are followed and the file attribute of the > final target of the link is read.” > In the Kubernetes setup I used to validate my work on the Kafka client, the > PKCS#12 keystore and truststore are mounted via a volume, but they are > actually generated from Vault Secret Agent (VSO) secrets exposed in another > volume. A sidecar container is responsible for creating the stores from the > PEM files mounted by VSO and regenerating them whenever VSO rotates the > certificates. > That said, you raise a valid point: if the stores were mounted directly > from Kubernetes Secrets or ConfigMaps, would relying on getLastModifiedTime > (which follows the final symbolic link) still be reliable? This needs to be > validated. > If it proves reliable in that scenario, all the better. Otherwise, I can > switch to computing and comparing a checksum of the files instead and > update the KIP accordingly. > > > > > > The other part of my comments in KIP-1119 was more about the usability > > for > > > something like Strimzi. I do not think the debounce interval really > > solves > > > the issue for us. With Kafka, you have a distributed system with: > > > * Multiple controllers > > > * Multiple brokers > > > * Additional components (e.g., an Operator, Cruise Control, etc.) > > > > > > So when I need to, for example, roll out a new Certificate Authority, > > and I > > > use mTLS authentication, I have to: > > > * First, roll out the trust to the new CA to all the components > > > * Only once all components trust the new CA, I can start rolling out the > > > new server/user certificates > > > * Once the new user and server certificates are used by all components, I > > > can remove the old CA > > > > > > But the debounce interval works only locally within a single Kafka node. > > So > > > while it allows me to safely reload the certificates within the node, > > which > > > is good, it does not help me with the understanding of the state on the > > > other nodes. To be able to orchestrate the whole system, I need a way to > > > find out if it has been reloaded in order to proceed with the next steps. > > > For example, open a TCP connection and sniff the actual TLS > > configuration. > > > But that is pretty ugly, and leaves a mess in the logs and so on. > > > > > > Don't get me wrong. I think this is a useful KIP, and I guess that in > > many > > > cases - especially when running things manually - it would work fine. It > > > would also work fine for reloading server certificates only, without an > > > mTLS. Which is a useful feature as well, with CAs such as Let's Encrypt > > > shortening the validity period of their server certificates. > > > > > > But for an automated solution like Strimzi, the main missing feature for > > > the hot-reloading of certificates is not about the auto-reload being done > > > by Kafka. It is an API that would tell us what is the current state of > > the > > > system in order to orchestrate more complicated things. > > > > GN2: I think that's a good point and perhaps a pain shared by a few as > > usually CAs are very long lived (of the order of years). > > I do agree it would be useful to have an "API" to see the state of the > > system. How about a metric for the sha256 hash of the contents of the > > truststore? > > Since the hash is 256 bit wide, we can split it into 4x64bit (long) chunks > > and have 4 "tags" on the metric, one for each chunk. That way we limit the > > cardinality of the metric to 4. What do you think? > > > > Jakub, I see your point about the limitations in setups like Strimzi. > However, as Gaurav mentioned, in most cases the CA tends to be long-lived. > In our setup we use mTLS: client certificates are short-lived (around 100 > minutes), while server certificates have a longer lifetime. In practice, CA > updates are relatively infrequent. > That said, I’m not sure I fully understand why you mentioned that this > would not work with mTLS and would only be useful for reloading server > certificates. Also, for server certificate reloading, isn’t that already > addressed by KIP-687 <https://cwiki.apache.org/confluence/x/lyfZCQ>? > > Gaurav, thank your for the suggestion, I like the idea of exposing a > metric. Jakub, do you think it could effectively be used as an “API” to > check the current state of the truststore? > Regards, > Skander > > > Regards, > > Gaurav > > > > > > > > Thanks & Regards > > > Jakub > > > > > > On Sat, Feb 21, 2026 at 3:58 PM Skander Soltane < > > [email protected]> > > > wrote: > > > > > >> Hi all, > > >> > > >> I'd like to start a discussion on a new KIP for SSL hot reload on the > > >> client side. > > >> > > >> You can find the KIP here : > > >> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1288%3A+SSL+Hot+Reload+for+Kafka+Clients > > >> > > >> I also drafted a PR implementing the KIP as I imagined it: > > >> https://github.com/apache/kafka/pull/21488 > > >> > > >> I'd love to hear your thoughts, especially on the polling approach vs > > >> WatchService, the debounce mechanism, and whether the registry design > > makes > > >> sense to you. > > >> > > >> Than you! > > >> Skander > > >> > > > > >
