The interface has been added. Do we intend to pass an
instance registerWatchedPath function as we discussed?
If yes then the FLIP needs further adjustments all places where now
Callable provided.

G


On Thu, Jun 5, 2025 at 2:27 PM Nicolas Fraison
<nicolas.frai...@datadoghq.com.invalid> wrote:

> Hi,
>
> It indeed make sense, FLIP has been updated
>
> Nicolas
>
> On Wed, Jun 4, 2025 at 12:10 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I've read it through. Basically looks good with one comment.
> > *registerWatchedPath function* has a *Callable* as callback. For the
> > current implementation that would be enough,
> > but when somebody would like to use that for any other use-case then it
> > would be hard. Examples:
> > * A single *call* function tells the user nothing what kind of event is
> > this
> > * the watch service supports 3 events (create, modify, delete), now when
> I
> > register I can get only updates (I presume)
> > * 1+ dir watch with the same callback is not possible
> >
> > My suggestion would be to create a proper callback with all the event
> type
> > functions and no-op default behavior with the following names[1].
> > This is ~30 lines addition but will increase the readability heavily + no
> > need to touch this code later.
> >
> > BR,
> > G
> >
> > [1]
> >
> >
> https://github.com/apache/flink-kubernetes-operator/commit/e5a325c48965a50d61d0aa29e61ba79e97f27082#diff-a30b3ed9b8c53e998b15d7da7ad2e54374c98ffc3c920f76a70bce3fb37a9b2eR87-R93
> >
> >
> > On Tue, Jun 3, 2025 at 8:53 AM Nicolas Fraison
> > <nicolas.frai...@datadoghq.com.invalid> wrote:
> >
> > > Hi,
> > >
> > > The FLIP has been updated.
> > > Let me know if you have some other comments.
> > >
> > > Nicolas
> > >
> > > On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison <
> > > nicolas.frai...@datadoghq.com> wrote:
> > >
> > > > Hi Gabor,
> > > >
> > > > Thanks for the feedback.
> > > > The overall proposal with atomic dirty flag being set by the callback
> > > will
> > > > indeed work with any kind of implementation.
> > > >
> > > >  Will see to update the FLIP in a week or 2 if there are no other
> > > comments
> > > > on it
> > > >
> > > > Nicolas
> > > >
> > > >
> > > > On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi <
> > > gabor.g.somo...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi Nicolas,
> > > >>
> > > >> Related SSLContext I've not gone through all the cases where we need
> > to
> > > >> reload so instead I'm sharing the concept.
> > > >> The main intention is that we must control all execution paths which
> > > >> decide
> > > >> which certificate used for authentication.
> > > >> Creating an SSLContext decorator which checks reload first and then
> > > >> forwards all calls to the original (wrapped) context
> > > >> is one way to achieve that. If there are different implementations
> > which
> > > >> end up in similar behavior then it's fine.
> > > >>
> > > >> BR,
> > > >> G
> > > >>
> > > >>
> > > >> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison
> > > >> <nicolas.frai...@datadoghq.com.invalid> wrote:
> > > >>
> > > >> > Hi,
> > > >> >
> > > >> > Overall your proposal looks great.
> > > >> > The event handling must indeed be super fast and we must also not
> > > change
> > > >> > original code path if reload not needed
> > > >> >
> > > >> > I have some concerns around the ReloadableSSLContext implementing
> > all
> > > >> > SSLContext
> > > >> > Do you really mean SSLContext (java SSLContext one) or do you
> refer
> > to
> > > >> the
> > > >> > SslContext from netty?
> > > >> >
> > > >> > If java SSLContext
> > > >> > - I'm not sure how this will manage reload from the BlobServer
> > > >> > BlobServer relies on creation of an SSLServerSocketFactory from
> the
> > > >> > SSLContext.
> > > >> > But from my current understanding the SSLServerSocketFactory does
> > not
> > > >> have
> > > >> > any connection with the SSLContext.
> > > >> > It only has some with an SSLContextImpl extends SSLContextSpi.
> > > >> > I think we will need a callback here to enforce recreation of the
> > > >> > BlobServer socket.
> > > >> > - I'm also don't see how to attach this to the SslContext from
> netty
> > > >> >
> > > >> > If netty SslContext
> > > >> > - we would still need to have the callback to recreate the
> > BlobServer
> > > >> > socket
> > > >> > - for netty and pekko we should be able to rely on it
> > > >> >
> > > >> > Nicolas
> > > >> >
> > > >> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi <
> > > >> gabor.g.somo...@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi All,
> > > >> > >
> > > >> > > I've read through the concerns/proposals related the watch
> service
> > > and
> > > >> > here
> > > >> > > are my conclusions:
> > > >> > > * Watch service can watch local file systems only: It's fair to
> > say
> > > >> that
> > > >> > > certificates must be copied to local FS in order to work (init
> > > >> container
> > > >> > cp
> > > >> > > command or something)
> > > >> > > * Watch service can send multiple events even for a single
> > directory
> > > >> > > change: this can be mitigated with a single atomic dirty flag
> (see
> > > my
> > > >> > > design suggestion)
> > > >> > > * Polling file modification time: When I hear any kind of
> polling
> > > >> > > implemented by us is just something I'm mostly opposing
> > > >> > > * security.ssl.internal.keystore.reload.duration: Security
> > features
> > > >> must
> > > >> > be
> > > >> > > executed nearly immediately no matter what
> > > >> > >
> > > >> > > As a general remark, not sure how many users are using the watch
> > > >> service
> > > >> > > based approach in the operator but until now I've not seen any
> > issue
> > > >> > > related to that.
> > > >> > > If somebody is having some specifics then please share.
> > > >> > >
> > > >> > > For now I would shoot for keystore not to have feature creep.
> When
> > > >> there
> > > >> > is
> > > >> > > a valid use-case, community interest and the keystore story is
> > > already
> > > >> > rock
> > > >> > > stable
> > > >> > > then we can consider to involve truststore later.
> > > >> > >
> > > >> > > Having the mentioned assumption that the operator approach
> works,
> > > >> here is
> > > >> > > my high level proposal:
> > > >> > > * Let's have enable flag for all such watch functionality. If
> it's
> > > >> false
> > > >> > > then the currently existing functionality must remain as-is
> > > >> > > * Let's have a LocalFSWatchService which is a singleton which
> has
> > no
> > > >> path
> > > >> > > registrations by default
> > > >> > > * Add path registration functionality which is synchronised
> where
> > a
> > > >> > > callback can be registered
> > > >> > > * Let's have a ReloadableSSLContext which implements the
> mentioned
> > > >> > callback
> > > >> > > * Inside the callback set an atomic dirty flag only (this can
> > handle
> > > >> > > multiple events for the same directory change + event handling
> in
> > > the
> > > >> > watch
> > > >> > > service must be extreme fast)
> > > >> > > * Inside ReloadableSSLContext all SSLContext actions must be
> > > >> overridden.
> > > >> > At
> > > >> > > the beginning of each function dirty flag must be checked
> > > >> > > and if dirty then certificates must be reloaded, flag can be set
> > > back
> > > >> to
> > > >> > > false (then original functionality call). It's extremely
> important
> > > >> that
> > > >> > > context reload must be synchronised.
> > > >> > > A synchronised boolean check + possible context reload can
> consume
> > > >> some
> > > >> > > time but I wouldn't expect any significant performance drop.
> > > >> > >
> > > >> > > My proposal main drivers:
> > > >> > > * Original code path must run when no watch service asked
> > > >> > > * Super fast event handling because million events may come in
> > (not
> > > >> > > expecting but we should be prepared)
> > > >> > > * Clean separation between dir/file watch and file/dir usage
> > > >> > > * Well considered synchronisation model
> > > >> > > * Extensive unit testing because we're intended to touch the
> heart
> > > of
> > > >> > Flink
> > > >> > >
> > > >> > > Happy to hear other opinions.
> > > >> > >
> > > >> > > BR,
> > > >> > > G
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison
> > > >> > > <nicolas.frai...@datadoghq.com.invalid> wrote:
> > > >> > >
> > > >> > > > Thanks all for your feedback and sorry for the late answer (I
> > was
> > > on
> > > >> > > > holiday).
> > > >> > > >
> > > >> > > > 1. Indeed it would add 4 threads.
> > > >> > > > For the non pekko component we can indeed have one watcher
> > service
> > > >> used
> > > >> > > to
> > > >> > > > reload SSLContext for those components
> > > >> > > > For pekko this is a little more challenging as the creation of
> > the
> > > >> > pekko
> > > >> > > > ssl engine is managed by pekko himself.
> > > >> > > > Flink only generates appropriate config with class to execute
> to
> > > >> > initiate
> > > >> > > > the pekko ssl engine [1]. This means that I will not be able
> to
> > > >> provide
> > > >> > > the
> > > >> > > > watcher service to this ssl engine.
> > > >> > > > One solution would be to rely on a singleton instead of a
> > service
> > > >> > > injected
> > > >> > > > in each component but I'm not sure this is fine to use such in
> > > >> flink.
> > > >> > > > WDYT?
> > > >> > > >
> > > >> > > > We can also add a specific flink configuration
> > > >> > > > (security.ssl.internal.keystore.reload.enable) to only add
> this
> > > >> watcher
> > > >> > > > mechanism if the config is enabled to avoid adding those
> threads
> > > if
> > > >> > this
> > > >> > > is
> > > >> > > > not needed.
> > > >> > > >
> > > >> > > >
> > > >> > > > 2. I'm fine with the LocalFSWatchService naming.
> > > >> > > >
> > > >> > > > 3. Also agree that some e2e tests must be added.
> > > >> > > >
> > > >> > > >
> > > >> > > > @doguscan namal, Thanks for challenging the proposal.
> > > >> > > > FYI, we are planning to rely on certificates with really short
> > > >> > validity.
> > > >> > > >
> > > >> > > > D1. It seems that the proposal to rely on a reload period will
> > > still
> > > >> > face
> > > >> > > > potential issues:
> > > >> > > > - receiving events while file content updates are still in
> > > progress:
> > > >> > > > There is no guarantee that we will not load the certificate
> > while
> > > >> file
> > > >> > > > content updates are still in progress
> > > >> > > > - while it will not be affected by multiple notifications, we
> > can
> > > >> > reach a
> > > >> > > > point where only the truststore is updated when the reload
> > > happens.
> > > >> > > > Which means that if the keystore is updated just after, it
> will
> > > not
> > > >> be
> > > >> > > > taken in account before next run of the reload mechanism
> > > >> > > >
> > > >> > > > I think that with WatchService and appropriate reload grace
> > period
> > > >> > > > mechanism we should be able to mitigate those 2 issues
> (ensuring
> > > >> > minimum
> > > >> > > > reload even with multiple notify)
> > > >> > > >
> > > >> > > > From KIP-1119 [2] it looks like the same kind of requirements
> is
> > > >> under
> > > >> > > > discussion for Kafka to also rely on the WatchService Java API
> > > >> > > (SpringBoot
> > > >> > > > seems to also rely on this API to manage ssl reload [3]).
> > > >> > > >
> > > >> > > > D3. Do we have a real use case for this?
> > > >> > > >
> > > >> > > > [1]
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372
> > > >> > > >
> > > >> > > > [2]
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload
> > > >> > > >
> > > >> > > > [3]
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading
> > > >> > > >
> > > >> > > > Regards,
> > > >> > > > Nicolas
> > > >> > > >
> > > >> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal <
> > > >> > > namal.dogus...@gmail.com>
> > > >> > > > wrote:
> > > >> > > >
> > > >> > > > > Hi Nicolas, thanks for the FLIP.
> > > >> > > > >
> > > >> > > > > I am fully supportive of the motivation and we should be
> > > >> supporting
> > > >> > > this
> > > >> > > > > feature. Here are couple of comments from my side:
> > > >> > > > >
> > > >> > > > > D1)
> > > >> > > > > Since you shared the implementation details on the FLIP as
> > > well, I
> > > >> > > would
> > > >> > > > > like to discuss whether using Java's WatchService is the
> best
> > > >> choice
> > > >> > > > here.
> > > >> > > > >
> > > >> > > > > I believe that the certificate renewal is not a frequent
> > > >> operation.
> > > >> > > Even
> > > >> > > > > once a day certificate renewals are not realistic(except for
> > > test
> > > >> > cases
> > > >> > > > > maybe) but let's assume that this covers up the p99.99 of
> the
> > > use
> > > >> > > cases.
> > > >> > > > I
> > > >> > > > > am confident on this estimation since there hasn't been a
> > > request
> > > >> > from
> > > >> > > > the
> > > >> > > > > community for this feature so far, which confirms that
> people
> > > were
> > > >> > okay
> > > >> > > > > with infrequent cluster restarts. Following that it is
> > > >> infrequent, I
> > > >> > > > > believe that spawning up a thread that watches the file
> > > >> modification
> > > >> > > > > operations is not the best use of the limited resources on a
> > > >> cluster.
> > > >> > > > >
> > > >> > > > > There are some known limitations of the WatchService as well
> > > such
> > > >> as
> > > >> > > > > receiving multiple modification events for the same
> occurence
> > > [1],
> > > >> > > > > inotify's (WatchService's underlying mechanism in Linux
> > > >> environments)
> > > >> > > > > problems on containerized environments due to remote file
> > > systems
> > > >> [2]
> > > >> > > or
> > > >> > > > > receiving events while file content updates are still in
> > > progress.
> > > >> > > [3]. I
> > > >> > > > > do not know if these limitations are addressed in the newer
> > > >> versions
> > > >> > > but
> > > >> > > > > regardless of that it is clear that we may face with some
> ugly
> > > >> edge
> > > >> > > cases
> > > >> > > > > due to that.
> > > >> > > > >
> > > >> > > > > Given these complications, I would recommend just creating a
> > new
> > > >> > > > SSLContext
> > > >> > > > > after a configured duration is expired. We could record the
> > > >> timestamp
> > > >> > > > when
> > > >> > > > > the previous SSLContext is created and update it after a
> > > >> configured
> > > >> > > > > duration is passed. This will be much easier to test and
> > reason
> > > >> about
> > > >> > > > when
> > > >> > > > > it is running on production. This will eliminate the
> necessity
> > > to
> > > >> > > reason
> > > >> > > > > about the file modification operations as well.
> > > >> > > > >
> > > >> > > > > I briefly skimmed through the classes that need to be
> modified
> > > >> and it
> > > >> > > > > looked feasible for me. Let me know what are your comments
> on
> > > >> these.
> > > >> > > > >
> > > >> > > > > Note that this is already used in the Kafka world where a
> new
> > > >> > > SSLContext
> > > >> > > > is
> > > >> > > > > created after 12 hours.
> > > >> > > > >
> > > >> > > > > D2) We could provide a configuration to the user, such as
> > > >> > > > > "security.ssl.internal.keystore.reload.duration" so they
> could
> > > >> decide
> > > >> > > how
> > > >> > > > > often the new certificates should be loaded.
> > > >> > > > >
> > > >> > > > > D3)
> > > >> > > > > On the other hand, I wonder whether we should also handle
> > > >> supporting
> > > >> > > > > updating the file paths of the truststores and keystores
> under
> > > >> this
> > > >> > > FLIP
> > > >> > > > as
> > > >> > > > > well. Since the name of the FLIP is "Handle TLS Certificate
> > > >> Renewal"
> > > >> > I
> > > >> > > > > think we could bring that into scope too :)
> > > >> > > > >
> > > >> > > > > Thanks
> > > >> > > > >
> > > >> > > > > [1]
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event
> > > >> > > > >
> > > >> > > > > [2]
> > > https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/
> > > >> > > > >
> > > >> > > > > [3]
> > > >> > > >
> > > >>
> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html
> > > >> > > > >
> > > >> > > > > [4] See also Platform Dependencies -
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi <
> > > >> > gabor.g.somo...@gmail.com
> > > >> > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Robert,
> > > >> > > > > >
> > > >> > > > > > Since I've added the same feature to the operator I'll
> take
> > a
> > > >> look
> > > >> > at
> > > >> > > > it.
> > > >> > > > > > Though it won't be lightning fast since I'm having several
> > > weeks
> > > >> > off.
> > > >> > > > > >
> > > >> > > > > > Your questions are valid especially considering the fact
> > that
> > > >> this
> > > >> > > > > feature
> > > >> > > > > > touches the hearth of the authentication so this must be
> > rock
> > > >> solid
> > > >> > > > > > in order to avoid grey hair :)
> > > >> > > > > >
> > > >> > > > > > (1) I would vote on a single service which is heavily unit
> > > >> tested
> > > >> > > with
> > > >> > > > > > all the possible combinations including threading.
> > > >> > > > > >
> > > >> > > > > > Some standalone app could be added to really play with it
> > > (that
> > > >> > would
> > > >> > > > > help
> > > >> > > > > > review).
> > > >> > > > > > I mean, create X files, start Y threads, and make
> > assertions.
> > > >> > > > > > The reason why I'm suggesting it is the fact that AFAIR
> the
> > > >> watch
> > > >> > > > service
> > > >> > > > > > is quite sensitive even in single thread. If we could do
> > this
> > > >> in a
> > > >> > > > finite
> > > >> > > > > > time
> > > >> > > > > > consuming unit test then it's even better.
> > > >> > > > > >
> > > >> > > > > > (2) +1 on that name to avoid confusion
> > > >> > > > > >
> > > >> > > > > > (3) I agree that some e2e is must, however this can be
> > easily
> > > >> and
> > > >> > > > deeply
> > > >> > > > > > unit
> > > >> > > > > > tested so that part is also essential. One key test here
> is
> > > when
> > > >> > > > > > certificates
> > > >> > > > > > are not changing then no action must be performed (not to
> > > break
> > > >> the
> > > >> > > > whole
> > > >> > > > > > system apart).
> > > >> > > > > >
> > > >> > > > > > Purely personal opinion but such feature developments are
> > slow
> > > >> by
> > > >> > > > nature
> > > >> > > > > > because
> > > >> > > > > > of edge case / stress testing.
> > > >> > > > > >
> > > >> > > > > > BR,
> > > >> > > > > > G
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger <
> > > >> > rmetz...@apache.org>
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Hi Nicolas,
> > > >> > > > > > >
> > > >> > > > > > > This looks like a nice improvement, thanks for the write
> > up.
> > > >> > > > > > > Are you in touch with any committer who's willing to
> > review
> > > /
> > > >> > merge
> > > >> > > > > this?
> > > >> > > > > > >
> > > >> > > > > > > Some random questions on the FLIP:
> > > >> > > > > > > (1)  "Each service that depends on TLS certificates will
> > > >> > > initialize a
> > > >> > > > > > > FileSytemWatchService"
> > > >> > > > > > >
> > > >> > > > > > > It seems that there are 4 components using SSL, does
> this
> > > mean
> > > >> > > there
> > > >> > > > > will
> > > >> > > > > > > be 4 additional threads running, watching the same set
> of
> > > >> files?
> > > >> > > > > > > Wouldn't it be better to introduce a central file
> watching
> > > >> > service,
> > > >> > > > and
> > > >> > > > > > SSL
> > > >> > > > > > > users can subscribe to updates, to reduce the number of
> > > >> threads?
> > > >> > > > > > > If this makes the whole effort 4x more complicated, I
> > > wouldn't
> > > >> > > > consider
> > > >> > > > > > it,
> > > >> > > > > > > but if its roughly the same effort, we should :)
> > > >> > > > > > >
> > > >> > > > > > > (2) "FileSytemWatchService"
> > > >> > > > > > > When I read this name, I was wondering, whether this is
> > > >> somehow
> > > >> > > > related
> > > >> > > > > > to
> > > >> > > > > > > the Flink "FileSystem" classes. Which I think its' not.
> > > >> > > > > > > Maybe a different name, that makes this separation more
> > > >> explicit,
> > > >> > > > would
> > > >> > > > > > > make sense. Maybe "LocalFSWatchService"?
> > > >> > > > > > > (I'm sorry to bring up naming stuff -- its very
> > subjective,
> > > >> and
> > > >> > > > > > difficult)
> > > >> > > > > > >
> > > >> > > > > > > (3) For the test plan: There seem to be some SSL related
> > e2e
> > > >> > tests:
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh
> > > >> > > > > > > It would be nice to extend them to cover this feature as
> > > >> well. I
> > > >> > > > would
> > > >> > > > > > hate
> > > >> > > > > > > for this feature to slowly break by future changes, so
> > good
> > > >> e2e
> > > >> > > test
> > > >> > > > > > > coverage is key, in particular bc so many components are
> > > >> > involved.
> > > >> > > > > > >
> > > >> > > > > > > Best,
> > > >> > > > > > > Robert
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison
> > > >> > > > > > > <nicolas.frai...@datadoghq.com.invalid> wrote:
> > > >> > > > > > >
> > > >> > > > > > > > Hi All,
> > > >> > > > > > > >
> > > >> > > > > > > > I'd like to start a discussion to Handle TLS
> Certificate
> > > >> > Renewal
> > > >> > > > > > > > Please provide some feedback on this proposal:
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal
> > > >> > > > > > > >
> > > >> > > > > > > > Regards,
> > > >> > > > > > > >
> > > >> > > > > > > > Nicolas Fraison
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to