Re: [DISCUSS] FLIP-523 - Handle TLS Certificate Renewal

Nicolas Fraison Mon, 02 Jun 2025 23:53:37 -0700

Hi,

The FLIP has been updated.
Let me know if you have some other comments.


Nicolas

On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison <
nicolas.frai...@datadoghq.com> wrote:

> Hi Gabor,
>
> Thanks for the feedback.
> The overall proposal with atomic dirty flag being set by the callback will
> indeed work with any kind of implementation.
>
>  Will see to update the FLIP in a week or 2 if there are no other comments
> on it
>
> Nicolas
>
>
> On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
>> Hi Nicolas,
>>
>> Related SSLContext I've not gone through all the cases where we need to
>> reload so instead I'm sharing the concept.
>> The main intention is that we must control all execution paths which
>> decide
>> which certificate used for authentication.
>> Creating an SSLContext decorator which checks reload first and then
>> forwards all calls to the original (wrapped) context
>> is one way to achieve that. If there are different implementations which
>> end up in similar behavior then it's fine.
>>
>> BR,
>> G
>>
>>
>> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison
>> <nicolas.frai...@datadoghq.com.invalid> wrote:
>>
>> > Hi,
>> >
>> > Overall your proposal looks great.
>> > The event handling must indeed be super fast and we must also not change
>> > original code path if reload not needed
>> >
>> > I have some concerns around the ReloadableSSLContext implementing all
>> > SSLContext
>> > Do you really mean SSLContext (java SSLContext one) or do you refer to
>> the
>> > SslContext from netty?
>> >
>> > If java SSLContext
>> > - I'm not sure how this will manage reload from the BlobServer
>> > BlobServer relies on creation of an SSLServerSocketFactory from the
>> > SSLContext.
>> > But from my current understanding the SSLServerSocketFactory does not
>> have
>> > any connection with the SSLContext.
>> > It only has some with an SSLContextImpl extends SSLContextSpi.
>> > I think we will need a callback here to enforce recreation of the
>> > BlobServer socket.
>> > - I'm also don't see how to attach this to the SslContext from netty
>> >
>> > If netty SslContext
>> > - we would still need to have the callback to recreate the BlobServer
>> > socket
>> > - for netty and pekko we should be able to rely on it
>> >
>> > Nicolas
>> >
>> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi <
>> gabor.g.somo...@gmail.com>
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > I've read through the concerns/proposals related the watch service and
>> > here
>> > > are my conclusions:
>> > > * Watch service can watch local file systems only: It's fair to say
>> that
>> > > certificates must be copied to local FS in order to work (init
>> container
>> > cp
>> > > command or something)
>> > > * Watch service can send multiple events even for a single directory
>> > > change: this can be mitigated with a single atomic dirty flag (see my
>> > > design suggestion)
>> > > * Polling file modification time: When I hear any kind of polling
>> > > implemented by us is just something I'm mostly opposing
>> > > * security.ssl.internal.keystore.reload.duration: Security features
>> must
>> > be
>> > > executed nearly immediately no matter what
>> > >
>> > > As a general remark, not sure how many users are using the watch
>> service
>> > > based approach in the operator but until now I've not seen any issue
>> > > related to that.
>> > > If somebody is having some specifics then please share.
>> > >
>> > > For now I would shoot for keystore not to have feature creep. When
>> there
>> > is
>> > > a valid use-case, community interest and the keystore story is already
>> > rock
>> > > stable
>> > > then we can consider to involve truststore later.
>> > >
>> > > Having the mentioned assumption that the operator approach works,
>> here is
>> > > my high level proposal:
>> > > * Let's have enable flag for all such watch functionality. If it's
>> false
>> > > then the currently existing functionality must remain as-is
>> > > * Let's have a LocalFSWatchService which is a singleton which has no
>> path
>> > > registrations by default
>> > > * Add path registration functionality which is synchronised where a
>> > > callback can be registered
>> > > * Let's have a ReloadableSSLContext which implements the mentioned
>> > callback
>> > > * Inside the callback set an atomic dirty flag only (this can handle
>> > > multiple events for the same directory change + event handling in the
>> > watch
>> > > service must be extreme fast)
>> > > * Inside ReloadableSSLContext all SSLContext actions must be
>> overridden.
>> > At
>> > > the beginning of each function dirty flag must be checked
>> > > and if dirty then certificates must be reloaded, flag can be set back
>> to
>> > > false (then original functionality call). It's extremely important
>> that
>> > > context reload must be synchronised.
>> > > A synchronised boolean check + possible context reload can consume
>> some
>> > > time but I wouldn't expect any significant performance drop.
>> > >
>> > > My proposal main drivers:
>> > > * Original code path must run when no watch service asked
>> > > * Super fast event handling because million events may come in (not
>> > > expecting but we should be prepared)
>> > > * Clean separation between dir/file watch and file/dir usage
>> > > * Well considered synchronisation model
>> > > * Extensive unit testing because we're intended to touch the heart of
>> > Flink
>> > >
>> > > Happy to hear other opinions.
>> > >
>> > > BR,
>> > > G
>> > >
>> > >
>> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison
>> > > <nicolas.frai...@datadoghq.com.invalid> wrote:
>> > >
>> > > > Thanks all for your feedback and sorry for the late answer (I was on
>> > > > holiday).
>> > > >
>> > > > 1. Indeed it would add 4 threads.
>> > > > For the non pekko component we can indeed have one watcher service
>> used
>> > > to
>> > > > reload SSLContext for those components
>> > > > For pekko this is a little more challenging as the creation of the
>> > pekko
>> > > > ssl engine is managed by pekko himself.
>> > > > Flink only generates appropriate config with class to execute to
>> > initiate
>> > > > the pekko ssl engine [1]. This means that I will not be able to
>> provide
>> > > the
>> > > > watcher service to this ssl engine.
>> > > > One solution would be to rely on a singleton instead of a service
>> > > injected
>> > > > in each component but I'm not sure this is fine to use such in
>> flink.
>> > > > WDYT?
>> > > >
>> > > > We can also add a specific flink configuration
>> > > > (security.ssl.internal.keystore.reload.enable) to only add this
>> watcher
>> > > > mechanism if the config is enabled to avoid adding those threads if
>> > this
>> > > is
>> > > > not needed.
>> > > >
>> > > >
>> > > > 2. I'm fine with the LocalFSWatchService naming.
>> > > >
>> > > > 3. Also agree that some e2e tests must be added.
>> > > >
>> > > >
>> > > > @doguscan namal, Thanks for challenging the proposal.
>> > > > FYI, we are planning to rely on certificates with really short
>> > validity.
>> > > >
>> > > > D1. It seems that the proposal to rely on a reload period will still
>> > face
>> > > > potential issues:
>> > > > - receiving events while file content updates are still in progress:
>> > > > There is no guarantee that we will not load the certificate while
>> file
>> > > > content updates are still in progress
>> > > > - while it will not be affected by multiple notifications, we can
>> > reach a
>> > > > point where only the truststore is updated when the reload happens.
>> > > > Which means that if the keystore is updated just after, it will not
>> be
>> > > > taken in account before next run of the reload mechanism
>> > > >
>> > > > I think that with WatchService and appropriate reload grace period
>> > > > mechanism we should be able to mitigate those 2 issues (ensuring
>> > minimum
>> > > > reload even with multiple notify)
>> > > >
>> > > > From KIP-1119 [2] it looks like the same kind of requirements is
>> under
>> > > > discussion for Kafka to also rely on the WatchService Java API
>> > > (SpringBoot
>> > > > seems to also rely on this API to manage ssl reload [3]).
>> > > >
>> > > > D3. Do we have a real use case for this?
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372
>> > > >
>> > > > [2]
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload
>> > > >
>> > > > [3]
>> > > >
>> > > >
>> > >
>> >
>> https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading
>> > > >
>> > > > Regards,
>> > > > Nicolas
>> > > >
>> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal <
>> > > namal.dogus...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi Nicolas, thanks for the FLIP.
>> > > > >
>> > > > > I am fully supportive of the motivation and we should be
>> supporting
>> > > this
>> > > > > feature. Here are couple of comments from my side:
>> > > > >
>> > > > > D1)
>> > > > > Since you shared the implementation details on the FLIP as well, I
>> > > would
>> > > > > like to discuss whether using Java's WatchService is the best
>> choice
>> > > > here.
>> > > > >
>> > > > > I believe that the certificate renewal is not a frequent
>> operation.
>> > > Even
>> > > > > once a day certificate renewals are not realistic(except for test
>> > cases
>> > > > > maybe) but let's assume that this covers up the p99.99 of the use
>> > > cases.
>> > > > I
>> > > > > am confident on this estimation since there hasn't been a request
>> > from
>> > > > the
>> > > > > community for this feature so far, which confirms that people were
>> > okay
>> > > > > with infrequent cluster restarts. Following that it is
>> infrequent, I
>> > > > > believe that spawning up a thread that watches the file
>> modification
>> > > > > operations is not the best use of the limited resources on a
>> cluster.
>> > > > >
>> > > > > There are some known limitations of the WatchService as well such
>> as
>> > > > > receiving multiple modification events for the same occurence [1],
>> > > > > inotify's (WatchService's underlying mechanism in Linux
>> environments)
>> > > > > problems on containerized environments due to remote file systems
>> [2]
>> > > or
>> > > > > receiving events while file content updates are still in progress.
>> > > [3]. I
>> > > > > do not know if these limitations are addressed in the newer
>> versions
>> > > but
>> > > > > regardless of that it is clear that we may face with some ugly
>> edge
>> > > cases
>> > > > > due to that.
>> > > > >
>> > > > > Given these complications, I would recommend just creating a new
>> > > > SSLContext
>> > > > > after a configured duration is expired. We could record the
>> timestamp
>> > > > when
>> > > > > the previous SSLContext is created and update it after a
>> configured
>> > > > > duration is passed. This will be much easier to test and reason
>> about
>> > > > when
>> > > > > it is running on production. This will eliminate the necessity to
>> > > reason
>> > > > > about the file modification operations as well.
>> > > > >
>> > > > > I briefly skimmed through the classes that need to be modified
>> and it
>> > > > > looked feasible for me. Let me know what are your comments on
>> these.
>> > > > >
>> > > > > Note that this is already used in the Kafka world where a new
>> > > SSLContext
>> > > > is
>> > > > > created after 12 hours.
>> > > > >
>> > > > > D2) We could provide a configuration to the user, such as
>> > > > > "security.ssl.internal.keystore.reload.duration" so they could
>> decide
>> > > how
>> > > > > often the new certificates should be loaded.
>> > > > >
>> > > > > D3)
>> > > > > On the other hand, I wonder whether we should also handle
>> supporting
>> > > > > updating the file paths of the truststores and keystores under
>> this
>> > > FLIP
>> > > > as
>> > > > > well. Since the name of the FLIP is "Handle TLS Certificate
>> Renewal"
>> > I
>> > > > > think we could bring that into scope too :)
>> > > > >
>> > > > > Thanks
>> > > > >
>> > > > > [1]
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event
>> > > > >
>> > > > > [2] https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/
>> > > > >
>> > > > > [3]
>> > > >
>> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html
>> > > > >
>> > > > > [4] See also Platform Dependencies -
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html
>> > > > >
>> > > > >
>> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi <
>> > gabor.g.somo...@gmail.com
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hi Robert,
>> > > > > >
>> > > > > > Since I've added the same feature to the operator I'll take a
>> look
>> > at
>> > > > it.
>> > > > > > Though it won't be lightning fast since I'm having several weeks
>> > off.
>> > > > > >
>> > > > > > Your questions are valid especially considering the fact that
>> this
>> > > > > feature
>> > > > > > touches the hearth of the authentication so this must be rock
>> solid
>> > > > > > in order to avoid grey hair :)
>> > > > > >
>> > > > > > (1) I would vote on a single service which is heavily unit
>> tested
>> > > with
>> > > > > > all the possible combinations including threading.
>> > > > > >
>> > > > > > Some standalone app could be added to really play with it (that
>> > would
>> > > > > help
>> > > > > > review).
>> > > > > > I mean, create X files, start Y threads, and make assertions.
>> > > > > > The reason why I'm suggesting it is the fact that AFAIR the
>> watch
>> > > > service
>> > > > > > is quite sensitive even in single thread. If we could do this
>> in a
>> > > > finite
>> > > > > > time
>> > > > > > consuming unit test then it's even better.
>> > > > > >
>> > > > > > (2) +1 on that name to avoid confusion
>> > > > > >
>> > > > > > (3) I agree that some e2e is must, however this can be easily
>> and
>> > > > deeply
>> > > > > > unit
>> > > > > > tested so that part is also essential. One key test here is when
>> > > > > > certificates
>> > > > > > are not changing then no action must be performed (not to break
>> the
>> > > > whole
>> > > > > > system apart).
>> > > > > >
>> > > > > > Purely personal opinion but such feature developments are slow
>> by
>> > > > nature
>> > > > > > because
>> > > > > > of edge case / stress testing.
>> > > > > >
>> > > > > > BR,
>> > > > > > G
>> > > > > >
>> > > > > >
>> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger <
>> > rmetz...@apache.org>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi Nicolas,
>> > > > > > >
>> > > > > > > This looks like a nice improvement, thanks for the write up.
>> > > > > > > Are you in touch with any committer who's willing to review /
>> > merge
>> > > > > this?
>> > > > > > >
>> > > > > > > Some random questions on the FLIP:
>> > > > > > > (1)  "Each service that depends on TLS certificates will
>> > > initialize a
>> > > > > > > FileSytemWatchService"
>> > > > > > >
>> > > > > > > It seems that there are 4 components using SSL, does this mean
>> > > there
>> > > > > will
>> > > > > > > be 4 additional threads running, watching the same set of
>> files?
>> > > > > > > Wouldn't it be better to introduce a central file watching
>> > service,
>> > > > and
>> > > > > > SSL
>> > > > > > > users can subscribe to updates, to reduce the number of
>> threads?
>> > > > > > > If this makes the whole effort 4x more complicated, I wouldn't
>> > > > consider
>> > > > > > it,
>> > > > > > > but if its roughly the same effort, we should :)
>> > > > > > >
>> > > > > > > (2) "FileSytemWatchService"
>> > > > > > > When I read this name, I was wondering, whether this is
>> somehow
>> > > > related
>> > > > > > to
>> > > > > > > the Flink "FileSystem" classes. Which I think its' not.
>> > > > > > > Maybe a different name, that makes this separation more
>> explicit,
>> > > > would
>> > > > > > > make sense. Maybe "LocalFSWatchService"?
>> > > > > > > (I'm sorry to bring up naming stuff -- its very subjective,
>> and
>> > > > > > difficult)
>> > > > > > >
>> > > > > > > (3) For the test plan: There seem to be some SSL related e2e
>> > tests:
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh
>> > > > > > > It would be nice to extend them to cover this feature as
>> well. I
>> > > > would
>> > > > > > hate
>> > > > > > > for this feature to slowly break by future changes, so good
>> e2e
>> > > test
>> > > > > > > coverage is key, in particular bc so many components are
>> > involved.
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Robert
>> > > > > > >
>> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison
>> > > > > > > <nicolas.frai...@datadoghq.com.invalid> wrote:
>> > > > > > >
>> > > > > > > > Hi All,
>> > > > > > > >
>> > > > > > > > I'd like to start a discussion to Handle TLS Certificate
>> > Renewal
>> > > > > > > > Please provide some feedback on this proposal:
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal
>> > > > > > > >
>> > > > > > > > Regards,
>> > > > > > > >
>> > > > > > > > Nicolas Fraison
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] FLIP-523 - Handle TLS Certificate Renewal

Reply via email to