Hello Skander,

I think KIP-687 while accepted never got implemented so this is very much an 
issue on the broker side as well. Perhaps this can supersede it?

I also feel we should also adopt some sort of last-run-at and failure metrics 
as it would be useful for operators of both the brokers
and clients to alert on them.

Regards,
Gaurav

On 2026/03/05 10:23:23 Skander Soltane wrote:
>  Hello Jakub, Gaurav,
> 
> Thank you both for your feedback.
> 
> On Wed, Mar 4, 2026 at 8:23 PM Gaurav Narula <[email protected]> wrote:
> 
> > Hi Skander and Jakub,
> >
> > Please find my comments inline
> >
> > > On 4 Mar 2026, at 17:58, Jakub Scholz <[email protected]> wrote:
> > >
> > > Hi Skander,
> > >
> > > Thanks for the KIP. Here are some of my thoughts on it ...
> > >
> > > I think using a poller instead of the WatchService is a good choice. In
> > the
> > > previous KIP (KIP-1119), this was my main concern about why it would not
> > > work.
> > >
> > > However, are you sure that Files.getLastModifiedTime() will work on
> > > Kubernetes with something like a mounted ConfigMap or Secret? The file
> > > itself is a symlink, and its dates do not change when a Secret is
> > updated.
> > > At least when observed with something like bash's stat command. Only the
> > > dates of the file that the symlink points to change. So, out of my head,
> > > I'm not sure which timestamp Java would give you (I haven't tried it, to
> > be
> > > honest - I'm just wondering if you did and if it really works). If the
> > > timestamp doesn't work, maybe one can just read the content of the file
> > and
> > > store some checksum to compare it with in the next check?
> >
> > GN1: I think `Files.getLastModifiedTime()` has an overload for accepting
> > LinkOption and if none it passed it follows symlinks.
> > We should be fine as long as the timestamp for the file that the symlink
> > points to is updated.
> >
> 
> I think Gaurav is correct, according to the JavaDoc of getLastModifiedTime,
> “By default, symbolic links are followed and the file attribute of the
> final target of the link is read.”
> In the Kubernetes setup I used to validate my work on the Kafka client, the
> PKCS#12 keystore and truststore are mounted via a volume, but they are
> actually generated from Vault Secret Agent (VSO) secrets exposed in another
> volume. A sidecar container is responsible for creating the stores from the
> PEM files mounted by VSO and regenerating them whenever VSO rotates the
> certificates.
> That said, you raise a valid point: if the stores were mounted directly
> from Kubernetes Secrets or ConfigMaps, would relying on getLastModifiedTime
> (which follows the final symbolic link) still be reliable? This needs to be
> validated.
> If it proves reliable in that scenario, all the better. Otherwise, I can
> switch to computing and comparing a checksum of the files instead and
> update the KIP accordingly.
> 
> >
> > > The other part of my comments in KIP-1119 was more about the usability
> > for
> > > something like Strimzi. I do not think the debounce interval really
> > solves
> > > the issue for us. With Kafka, you have a distributed system with:
> > > * Multiple controllers
> > > * Multiple brokers
> > > * Additional components (e.g., an Operator, Cruise Control, etc.)
> > >
> > > So when I need to, for example, roll out a new Certificate Authority,
> > and I
> > > use mTLS authentication, I have to:
> > > * First, roll out the trust to the new CA to all the components
> > > * Only once all components trust the new CA, I can start rolling out the
> > > new server/user certificates
> > > * Once the new user and server certificates are used by all components, I
> > > can remove the old CA
> > >
> > > But the debounce interval works only locally within a single Kafka node.
> > So
> > > while it allows me to safely reload the certificates within the node,
> > which
> > > is good, it does not help me with the understanding of the state on the
> > > other nodes. To be able to orchestrate the whole system, I need a way to
> > > find out if it has been reloaded in order to proceed with the next steps.
> > > For example, open a TCP connection and sniff the actual TLS
> > configuration.
> > > But that is pretty ugly, and leaves a mess in the logs and so on.
> > >
> > > Don't get me wrong. I think this is a useful KIP, and I guess that in
> > many
> > > cases - especially when running things manually - it would work fine. It
> > > would also work fine for reloading server certificates only, without an
> > > mTLS. Which is a useful feature as well, with CAs such as Let's Encrypt
> > > shortening the validity period of their server certificates.
> > >
> > > But for an automated solution like Strimzi, the main missing feature for
> > > the hot-reloading of certificates is not about the auto-reload being done
> > > by Kafka. It is an API that would tell us what is the current state of
> > the
> > > system in order to orchestrate more complicated things.
> >
> > GN2: I think that's a good point and perhaps a pain shared by a few as
> > usually CAs are very long lived (of the order of years).
> > I do agree it would be useful to have an "API" to see the state of the
> > system. How about a metric for the sha256 hash of the contents of the
> > truststore?
> > Since the hash is 256 bit wide, we can split it into 4x64bit (long) chunks
> > and have 4 "tags" on the metric, one for each chunk. That way we limit the
> > cardinality of the metric to 4. What do you think?
> >
> > Jakub, I see your point about the limitations in setups like Strimzi.
> However, as Gaurav mentioned, in most cases the CA tends to be long-lived.
> In our setup we use mTLS: client certificates are short-lived (around 100
> minutes), while server certificates have a longer lifetime. In practice, CA
> updates are relatively infrequent.
> That said, I’m not sure I fully understand why you mentioned that this
> would not work with mTLS and would only be useful for reloading server
> certificates. Also, for server certificate reloading, isn’t that already
> addressed by KIP-687 <https://cwiki.apache.org/confluence/x/lyfZCQ>?
> 
> Gaurav, thank your for the suggestion, I like the idea of exposing a
> metric. Jakub, do you think it could effectively be used as an “API” to
> check the current state of the truststore?
> Regards,
> Skander
> 
> 
> Regards,
> > Gaurav
> >
> > >
> > > Thanks & Regards
> > > Jakub
> > >
> > > On Sat, Feb 21, 2026 at 3:58 PM Skander Soltane <
> > [email protected]>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> I'd like to start a discussion on a new KIP for SSL hot reload on the
> > >> client side.
> > >>
> > >> You can find the KIP here :
> > >>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1288%3A+SSL+Hot+Reload+for+Kafka+Clients
> > >>
> > >> I also drafted a PR implementing the KIP as I imagined it:
> > >> https://github.com/apache/kafka/pull/21488
> > >>
> > >> I'd love to hear your thoughts, especially on the polling approach vs
> > >> WatchService, the debounce mechanism, and whether the registry design
> > makes
> > >> sense to you.
> > >>
> > >> Than you!
> > >> Skander
> > >>
> >
> >
>

Reply via email to