Re: [DISCUSS] FLIP-523 - Handle TLS Certificate Renewal

Nicolas Fraison Thu, 05 Jun 2025 05:27:29 -0700

Hi,

It indeed make sense, FLIP has been updated


Nicolas

On Wed, Jun 4, 2025 at 12:10 PM Gabor Somogyi <[email protected]>
wrote:

> Hi,
>
> I've read it through. Basically looks good with one comment.
> *registerWatchedPath function* has a *Callable* as callback. For the
> current implementation that would be enough,
> but when somebody would like to use that for any other use-case then it
> would be hard. Examples:
> * A single *call* function tells the user nothing what kind of event is
> this
> * the watch service supports 3 events (create, modify, delete), now when I
> register I can get only updates (I presume)
> * 1+ dir watch with the same callback is not possible
>
> My suggestion would be to create a proper callback with all the event type
> functions and no-op default behavior with the following names[1].
> This is ~30 lines addition but will increase the readability heavily + no
> need to touch this code later.
>
> BR,
> G
>
> [1]
>
> https://github.com/apache/flink-kubernetes-operator/commit/e5a325c48965a50d61d0aa29e61ba79e97f27082#diff-a30b3ed9b8c53e998b15d7da7ad2e54374c98ffc3c920f76a70bce3fb37a9b2eR87-R93
>
>
> On Tue, Jun 3, 2025 at 8:53 AM Nicolas Fraison
> <[email protected]> wrote:
>
> > Hi,
> >
> > The FLIP has been updated.
> > Let me know if you have some other comments.
> >
> > Nicolas
> >
> > On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison <
> > [email protected]> wrote:
> >
> > > Hi Gabor,
> > >
> > > Thanks for the feedback.
> > > The overall proposal with atomic dirty flag being set by the callback
> > will
> > > indeed work with any kind of implementation.
> > >
> > >  Will see to update the FLIP in a week or 2 if there are no other
> > comments
> > > on it
> > >
> > > Nicolas
> > >
> > >
> > > On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi <
> > [email protected]>
> > > wrote:
> > >
> > >> Hi Nicolas,
> > >>
> > >> Related SSLContext I've not gone through all the cases where we need
> to
> > >> reload so instead I'm sharing the concept.
> > >> The main intention is that we must control all execution paths which
> > >> decide
> > >> which certificate used for authentication.
> > >> Creating an SSLContext decorator which checks reload first and then
> > >> forwards all calls to the original (wrapped) context
> > >> is one way to achieve that. If there are different implementations
> which
> > >> end up in similar behavior then it's fine.
> > >>
> > >> BR,
> > >> G
> > >>
> > >>
> > >> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison
> > >> <[email protected]> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > Overall your proposal looks great.
> > >> > The event handling must indeed be super fast and we must also not
> > change
> > >> > original code path if reload not needed
> > >> >
> > >> > I have some concerns around the ReloadableSSLContext implementing
> all
> > >> > SSLContext
> > >> > Do you really mean SSLContext (java SSLContext one) or do you refer
> to
> > >> the
> > >> > SslContext from netty?
> > >> >
> > >> > If java SSLContext
> > >> > - I'm not sure how this will manage reload from the BlobServer
> > >> > BlobServer relies on creation of an SSLServerSocketFactory from the
> > >> > SSLContext.
> > >> > But from my current understanding the SSLServerSocketFactory does
> not
> > >> have
> > >> > any connection with the SSLContext.
> > >> > It only has some with an SSLContextImpl extends SSLContextSpi.
> > >> > I think we will need a callback here to enforce recreation of the
> > >> > BlobServer socket.
> > >> > - I'm also don't see how to attach this to the SslContext from netty
> > >> >
> > >> > If netty SslContext
> > >> > - we would still need to have the callback to recreate the
> BlobServer
> > >> > socket
> > >> > - for netty and pekko we should be able to rely on it
> > >> >
> > >> > Nicolas
> > >> >
> > >> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi <
> > >> [email protected]>
> > >> > wrote:
> > >> >
> > >> > > Hi All,
> > >> > >
> > >> > > I've read through the concerns/proposals related the watch service
> > and
> > >> > here
> > >> > > are my conclusions:
> > >> > > * Watch service can watch local file systems only: It's fair to
> say
> > >> that
> > >> > > certificates must be copied to local FS in order to work (init
> > >> container
> > >> > cp
> > >> > > command or something)
> > >> > > * Watch service can send multiple events even for a single
> directory
> > >> > > change: this can be mitigated with a single atomic dirty flag (see
> > my
> > >> > > design suggestion)
> > >> > > * Polling file modification time: When I hear any kind of polling
> > >> > > implemented by us is just something I'm mostly opposing
> > >> > > * security.ssl.internal.keystore.reload.duration: Security
> features
> > >> must
> > >> > be
> > >> > > executed nearly immediately no matter what
> > >> > >
> > >> > > As a general remark, not sure how many users are using the watch
> > >> service
> > >> > > based approach in the operator but until now I've not seen any
> issue
> > >> > > related to that.
> > >> > > If somebody is having some specifics then please share.
> > >> > >
> > >> > > For now I would shoot for keystore not to have feature creep. When
> > >> there
> > >> > is
> > >> > > a valid use-case, community interest and the keystore story is
> > already
> > >> > rock
> > >> > > stable
> > >> > > then we can consider to involve truststore later.
> > >> > >
> > >> > > Having the mentioned assumption that the operator approach works,
> > >> here is
> > >> > > my high level proposal:
> > >> > > * Let's have enable flag for all such watch functionality. If it's
> > >> false
> > >> > > then the currently existing functionality must remain as-is
> > >> > > * Let's have a LocalFSWatchService which is a singleton which has
> no
> > >> path
> > >> > > registrations by default
> > >> > > * Add path registration functionality which is synchronised where
> a
> > >> > > callback can be registered
> > >> > > * Let's have a ReloadableSSLContext which implements the mentioned
> > >> > callback
> > >> > > * Inside the callback set an atomic dirty flag only (this can
> handle
> > >> > > multiple events for the same directory change + event handling in
> > the
> > >> > watch
> > >> > > service must be extreme fast)
> > >> > > * Inside ReloadableSSLContext all SSLContext actions must be
> > >> overridden.
> > >> > At
> > >> > > the beginning of each function dirty flag must be checked
> > >> > > and if dirty then certificates must be reloaded, flag can be set
> > back
> > >> to
> > >> > > false (then original functionality call). It's extremely important
> > >> that
> > >> > > context reload must be synchronised.
> > >> > > A synchronised boolean check + possible context reload can consume
> > >> some
> > >> > > time but I wouldn't expect any significant performance drop.
> > >> > >
> > >> > > My proposal main drivers:
> > >> > > * Original code path must run when no watch service asked
> > >> > > * Super fast event handling because million events may come in
> (not
> > >> > > expecting but we should be prepared)
> > >> > > * Clean separation between dir/file watch and file/dir usage
> > >> > > * Well considered synchronisation model
> > >> > > * Extensive unit testing because we're intended to touch the heart
> > of
> > >> > Flink
> > >> > >
> > >> > > Happy to hear other opinions.
> > >> > >
> > >> > > BR,
> > >> > > G
> > >> > >
> > >> > >
> > >> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison
> > >> > > <[email protected]> wrote:
> > >> > >
> > >> > > > Thanks all for your feedback and sorry for the late answer (I
> was
> > on
> > >> > > > holiday).
> > >> > > >
> > >> > > > 1. Indeed it would add 4 threads.
> > >> > > > For the non pekko component we can indeed have one watcher
> service
> > >> used
> > >> > > to
> > >> > > > reload SSLContext for those components
> > >> > > > For pekko this is a little more challenging as the creation of
> the
> > >> > pekko
> > >> > > > ssl engine is managed by pekko himself.
> > >> > > > Flink only generates appropriate config with class to execute to
> > >> > initiate
> > >> > > > the pekko ssl engine [1]. This means that I will not be able to
> > >> provide
> > >> > > the
> > >> > > > watcher service to this ssl engine.
> > >> > > > One solution would be to rely on a singleton instead of a
> service
> > >> > > injected
> > >> > > > in each component but I'm not sure this is fine to use such in
> > >> flink.
> > >> > > > WDYT?
> > >> > > >
> > >> > > > We can also add a specific flink configuration
> > >> > > > (security.ssl.internal.keystore.reload.enable) to only add this
> > >> watcher
> > >> > > > mechanism if the config is enabled to avoid adding those threads
> > if
> > >> > this
> > >> > > is
> > >> > > > not needed.
> > >> > > >
> > >> > > >
> > >> > > > 2. I'm fine with the LocalFSWatchService naming.
> > >> > > >
> > >> > > > 3. Also agree that some e2e tests must be added.
> > >> > > >
> > >> > > >
> > >> > > > @doguscan namal, Thanks for challenging the proposal.
> > >> > > > FYI, we are planning to rely on certificates with really short
> > >> > validity.
> > >> > > >
> > >> > > > D1. It seems that the proposal to rely on a reload period will
> > still
> > >> > face
> > >> > > > potential issues:
> > >> > > > - receiving events while file content updates are still in
> > progress:
> > >> > > > There is no guarantee that we will not load the certificate
> while
> > >> file
> > >> > > > content updates are still in progress
> > >> > > > - while it will not be affected by multiple notifications, we
> can
> > >> > reach a
> > >> > > > point where only the truststore is updated when the reload
> > happens.
> > >> > > > Which means that if the keystore is updated just after, it will
> > not
> > >> be
> > >> > > > taken in account before next run of the reload mechanism
> > >> > > >
> > >> > > > I think that with WatchService and appropriate reload grace
> period
> > >> > > > mechanism we should be able to mitigate those 2 issues (ensuring
> > >> > minimum
> > >> > > > reload even with multiple notify)
> > >> > > >
> > >> > > > From KIP-1119 [2] it looks like the same kind of requirements is
> > >> under
> > >> > > > discussion for Kafka to also rely on the WatchService Java API
> > >> > > (SpringBoot
> > >> > > > seems to also rely on this API to manage ssl reload [3]).
> > >> > > >
> > >> > > > D3. Do we have a real use case for this?
> > >> > > >
> > >> > > > [1]
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372
> > >> > > >
> > >> > > > [2]
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload
> > >> > > >
> > >> > > > [3]
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading
> > >> > > >
> > >> > > > Regards,
> > >> > > > Nicolas
> > >> > > >
> > >> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal <
> > >> > > [email protected]>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > Hi Nicolas, thanks for the FLIP.
> > >> > > > >
> > >> > > > > I am fully supportive of the motivation and we should be
> > >> supporting
> > >> > > this
> > >> > > > > feature. Here are couple of comments from my side:
> > >> > > > >
> > >> > > > > D1)
> > >> > > > > Since you shared the implementation details on the FLIP as
> > well, I
> > >> > > would
> > >> > > > > like to discuss whether using Java's WatchService is the best
> > >> choice
> > >> > > > here.
> > >> > > > >
> > >> > > > > I believe that the certificate renewal is not a frequent
> > >> operation.
> > >> > > Even
> > >> > > > > once a day certificate renewals are not realistic(except for
> > test
> > >> > cases
> > >> > > > > maybe) but let's assume that this covers up the p99.99 of the
> > use
> > >> > > cases.
> > >> > > > I
> > >> > > > > am confident on this estimation since there hasn't been a
> > request
> > >> > from
> > >> > > > the
> > >> > > > > community for this feature so far, which confirms that people
> > were
> > >> > okay
> > >> > > > > with infrequent cluster restarts. Following that it is
> > >> infrequent, I
> > >> > > > > believe that spawning up a thread that watches the file
> > >> modification
> > >> > > > > operations is not the best use of the limited resources on a
> > >> cluster.
> > >> > > > >
> > >> > > > > There are some known limitations of the WatchService as well
> > such
> > >> as
> > >> > > > > receiving multiple modification events for the same occurence
> > [1],
> > >> > > > > inotify's (WatchService's underlying mechanism in Linux
> > >> environments)
> > >> > > > > problems on containerized environments due to remote file
> > systems
> > >> [2]
> > >> > > or
> > >> > > > > receiving events while file content updates are still in
> > progress.
> > >> > > [3]. I
> > >> > > > > do not know if these limitations are addressed in the newer
> > >> versions
> > >> > > but
> > >> > > > > regardless of that it is clear that we may face with some ugly
> > >> edge
> > >> > > cases
> > >> > > > > due to that.
> > >> > > > >
> > >> > > > > Given these complications, I would recommend just creating a
> new
> > >> > > > SSLContext
> > >> > > > > after a configured duration is expired. We could record the
> > >> timestamp
> > >> > > > when
> > >> > > > > the previous SSLContext is created and update it after a
> > >> configured
> > >> > > > > duration is passed. This will be much easier to test and
> reason
> > >> about
> > >> > > > when
> > >> > > > > it is running on production. This will eliminate the necessity
> > to
> > >> > > reason
> > >> > > > > about the file modification operations as well.
> > >> > > > >
> > >> > > > > I briefly skimmed through the classes that need to be modified
> > >> and it
> > >> > > > > looked feasible for me. Let me know what are your comments on
> > >> these.
> > >> > > > >
> > >> > > > > Note that this is already used in the Kafka world where a new
> > >> > > SSLContext
> > >> > > > is
> > >> > > > > created after 12 hours.
> > >> > > > >
> > >> > > > > D2) We could provide a configuration to the user, such as
> > >> > > > > "security.ssl.internal.keystore.reload.duration" so they could
> > >> decide
> > >> > > how
> > >> > > > > often the new certificates should be loaded.
> > >> > > > >
> > >> > > > > D3)
> > >> > > > > On the other hand, I wonder whether we should also handle
> > >> supporting
> > >> > > > > updating the file paths of the truststores and keystores under
> > >> this
> > >> > > FLIP
> > >> > > > as
> > >> > > > > well. Since the name of the FLIP is "Handle TLS Certificate
> > >> Renewal"
> > >> > I
> > >> > > > > think we could bring that into scope too :)
> > >> > > > >
> > >> > > > > Thanks
> > >> > > > >
> > >> > > > > [1]
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event
> > >> > > > >
> > >> > > > > [2]
> > https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/
> > >> > > > >
> > >> > > > > [3]
> > >> > > >
> > >> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html
> > >> > > > >
> > >> > > > > [4] See also Platform Dependencies -
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html
> > >> > > > >
> > >> > > > >
> > >> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi <
> > >> > [email protected]
> > >> > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Hi Robert,
> > >> > > > > >
> > >> > > > > > Since I've added the same feature to the operator I'll take
> a
> > >> look
> > >> > at
> > >> > > > it.
> > >> > > > > > Though it won't be lightning fast since I'm having several
> > weeks
> > >> > off.
> > >> > > > > >
> > >> > > > > > Your questions are valid especially considering the fact
> that
> > >> this
> > >> > > > > feature
> > >> > > > > > touches the hearth of the authentication so this must be
> rock
> > >> solid
> > >> > > > > > in order to avoid grey hair :)
> > >> > > > > >
> > >> > > > > > (1) I would vote on a single service which is heavily unit
> > >> tested
> > >> > > with
> > >> > > > > > all the possible combinations including threading.
> > >> > > > > >
> > >> > > > > > Some standalone app could be added to really play with it
> > (that
> > >> > would
> > >> > > > > help
> > >> > > > > > review).
> > >> > > > > > I mean, create X files, start Y threads, and make
> assertions.
> > >> > > > > > The reason why I'm suggesting it is the fact that AFAIR the
> > >> watch
> > >> > > > service
> > >> > > > > > is quite sensitive even in single thread. If we could do
> this
> > >> in a
> > >> > > > finite
> > >> > > > > > time
> > >> > > > > > consuming unit test then it's even better.
> > >> > > > > >
> > >> > > > > > (2) +1 on that name to avoid confusion
> > >> > > > > >
> > >> > > > > > (3) I agree that some e2e is must, however this can be
> easily
> > >> and
> > >> > > > deeply
> > >> > > > > > unit
> > >> > > > > > tested so that part is also essential. One key test here is
> > when
> > >> > > > > > certificates
> > >> > > > > > are not changing then no action must be performed (not to
> > break
> > >> the
> > >> > > > whole
> > >> > > > > > system apart).
> > >> > > > > >
> > >> > > > > > Purely personal opinion but such feature developments are
> slow
> > >> by
> > >> > > > nature
> > >> > > > > > because
> > >> > > > > > of edge case / stress testing.
> > >> > > > > >
> > >> > > > > > BR,
> > >> > > > > > G
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger <
> > >> > [email protected]>
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Hi Nicolas,
> > >> > > > > > >
> > >> > > > > > > This looks like a nice improvement, thanks for the write
> up.
> > >> > > > > > > Are you in touch with any committer who's willing to
> review
> > /
> > >> > merge
> > >> > > > > this?
> > >> > > > > > >
> > >> > > > > > > Some random questions on the FLIP:
> > >> > > > > > > (1)  "Each service that depends on TLS certificates will
> > >> > > initialize a
> > >> > > > > > > FileSytemWatchService"
> > >> > > > > > >
> > >> > > > > > > It seems that there are 4 components using SSL, does this
> > mean
> > >> > > there
> > >> > > > > will
> > >> > > > > > > be 4 additional threads running, watching the same set of
> > >> files?
> > >> > > > > > > Wouldn't it be better to introduce a central file watching
> > >> > service,
> > >> > > > and
> > >> > > > > > SSL
> > >> > > > > > > users can subscribe to updates, to reduce the number of
> > >> threads?
> > >> > > > > > > If this makes the whole effort 4x more complicated, I
> > wouldn't
> > >> > > > consider
> > >> > > > > > it,
> > >> > > > > > > but if its roughly the same effort, we should :)
> > >> > > > > > >
> > >> > > > > > > (2) "FileSytemWatchService"
> > >> > > > > > > When I read this name, I was wondering, whether this is
> > >> somehow
> > >> > > > related
> > >> > > > > > to
> > >> > > > > > > the Flink "FileSystem" classes. Which I think its' not.
> > >> > > > > > > Maybe a different name, that makes this separation more
> > >> explicit,
> > >> > > > would
> > >> > > > > > > make sense. Maybe "LocalFSWatchService"?
> > >> > > > > > > (I'm sorry to bring up naming stuff -- its very
> subjective,
> > >> and
> > >> > > > > > difficult)
> > >> > > > > > >
> > >> > > > > > > (3) For the test plan: There seem to be some SSL related
> e2e
> > >> > tests:
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh
> > >> > > > > > > It would be nice to extend them to cover this feature as
> > >> well. I
> > >> > > > would
> > >> > > > > > hate
> > >> > > > > > > for this feature to slowly break by future changes, so
> good
> > >> e2e
> > >> > > test
> > >> > > > > > > coverage is key, in particular bc so many components are
> > >> > involved.
> > >> > > > > > >
> > >> > > > > > > Best,
> > >> > > > > > > Robert
> > >> > > > > > >
> > >> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison
> > >> > > > > > > <[email protected]> wrote:
> > >> > > > > > >
> > >> > > > > > > > Hi All,
> > >> > > > > > > >
> > >> > > > > > > > I'd like to start a discussion to Handle TLS Certificate
> > >> > Renewal
> > >> > > > > > > > Please provide some feedback on this proposal:
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal
> > >> > > > > > > >
> > >> > > > > > > > Regards,
> > >> > > > > > > >
> > >> > > > > > > > Nicolas Fraison
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] FLIP-523 - Handle TLS Certificate Renewal

Reply via email to