Hi, It indeed make sense, FLIP has been updated
Nicolas On Wed, Jun 4, 2025 at 12:10 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Hi, > > I've read it through. Basically looks good with one comment. > *registerWatchedPath function* has a *Callable* as callback. For the > current implementation that would be enough, > but when somebody would like to use that for any other use-case then it > would be hard. Examples: > * A single *call* function tells the user nothing what kind of event is > this > * the watch service supports 3 events (create, modify, delete), now when I > register I can get only updates (I presume) > * 1+ dir watch with the same callback is not possible > > My suggestion would be to create a proper callback with all the event type > functions and no-op default behavior with the following names[1]. > This is ~30 lines addition but will increase the readability heavily + no > need to touch this code later. > > BR, > G > > [1] > > https://github.com/apache/flink-kubernetes-operator/commit/e5a325c48965a50d61d0aa29e61ba79e97f27082#diff-a30b3ed9b8c53e998b15d7da7ad2e54374c98ffc3c920f76a70bce3fb37a9b2eR87-R93 > > > On Tue, Jun 3, 2025 at 8:53 AM Nicolas Fraison > <nicolas.frai...@datadoghq.com.invalid> wrote: > > > Hi, > > > > The FLIP has been updated. > > Let me know if you have some other comments. > > > > Nicolas > > > > On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison < > > nicolas.frai...@datadoghq.com> wrote: > > > > > Hi Gabor, > > > > > > Thanks for the feedback. > > > The overall proposal with atomic dirty flag being set by the callback > > will > > > indeed work with any kind of implementation. > > > > > > Will see to update the FLIP in a week or 2 if there are no other > > comments > > > on it > > > > > > Nicolas > > > > > > > > > On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi < > > gabor.g.somo...@gmail.com> > > > wrote: > > > > > >> Hi Nicolas, > > >> > > >> Related SSLContext I've not gone through all the cases where we need > to > > >> reload so instead I'm sharing the concept. > > >> The main intention is that we must control all execution paths which > > >> decide > > >> which certificate used for authentication. > > >> Creating an SSLContext decorator which checks reload first and then > > >> forwards all calls to the original (wrapped) context > > >> is one way to achieve that. If there are different implementations > which > > >> end up in similar behavior then it's fine. > > >> > > >> BR, > > >> G > > >> > > >> > > >> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison > > >> <nicolas.frai...@datadoghq.com.invalid> wrote: > > >> > > >> > Hi, > > >> > > > >> > Overall your proposal looks great. > > >> > The event handling must indeed be super fast and we must also not > > change > > >> > original code path if reload not needed > > >> > > > >> > I have some concerns around the ReloadableSSLContext implementing > all > > >> > SSLContext > > >> > Do you really mean SSLContext (java SSLContext one) or do you refer > to > > >> the > > >> > SslContext from netty? > > >> > > > >> > If java SSLContext > > >> > - I'm not sure how this will manage reload from the BlobServer > > >> > BlobServer relies on creation of an SSLServerSocketFactory from the > > >> > SSLContext. > > >> > But from my current understanding the SSLServerSocketFactory does > not > > >> have > > >> > any connection with the SSLContext. > > >> > It only has some with an SSLContextImpl extends SSLContextSpi. > > >> > I think we will need a callback here to enforce recreation of the > > >> > BlobServer socket. > > >> > - I'm also don't see how to attach this to the SslContext from netty > > >> > > > >> > If netty SslContext > > >> > - we would still need to have the callback to recreate the > BlobServer > > >> > socket > > >> > - for netty and pekko we should be able to rely on it > > >> > > > >> > Nicolas > > >> > > > >> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi < > > >> gabor.g.somo...@gmail.com> > > >> > wrote: > > >> > > > >> > > Hi All, > > >> > > > > >> > > I've read through the concerns/proposals related the watch service > > and > > >> > here > > >> > > are my conclusions: > > >> > > * Watch service can watch local file systems only: It's fair to > say > > >> that > > >> > > certificates must be copied to local FS in order to work (init > > >> container > > >> > cp > > >> > > command or something) > > >> > > * Watch service can send multiple events even for a single > directory > > >> > > change: this can be mitigated with a single atomic dirty flag (see > > my > > >> > > design suggestion) > > >> > > * Polling file modification time: When I hear any kind of polling > > >> > > implemented by us is just something I'm mostly opposing > > >> > > * security.ssl.internal.keystore.reload.duration: Security > features > > >> must > > >> > be > > >> > > executed nearly immediately no matter what > > >> > > > > >> > > As a general remark, not sure how many users are using the watch > > >> service > > >> > > based approach in the operator but until now I've not seen any > issue > > >> > > related to that. > > >> > > If somebody is having some specifics then please share. > > >> > > > > >> > > For now I would shoot for keystore not to have feature creep. When > > >> there > > >> > is > > >> > > a valid use-case, community interest and the keystore story is > > already > > >> > rock > > >> > > stable > > >> > > then we can consider to involve truststore later. > > >> > > > > >> > > Having the mentioned assumption that the operator approach works, > > >> here is > > >> > > my high level proposal: > > >> > > * Let's have enable flag for all such watch functionality. If it's > > >> false > > >> > > then the currently existing functionality must remain as-is > > >> > > * Let's have a LocalFSWatchService which is a singleton which has > no > > >> path > > >> > > registrations by default > > >> > > * Add path registration functionality which is synchronised where > a > > >> > > callback can be registered > > >> > > * Let's have a ReloadableSSLContext which implements the mentioned > > >> > callback > > >> > > * Inside the callback set an atomic dirty flag only (this can > handle > > >> > > multiple events for the same directory change + event handling in > > the > > >> > watch > > >> > > service must be extreme fast) > > >> > > * Inside ReloadableSSLContext all SSLContext actions must be > > >> overridden. > > >> > At > > >> > > the beginning of each function dirty flag must be checked > > >> > > and if dirty then certificates must be reloaded, flag can be set > > back > > >> to > > >> > > false (then original functionality call). It's extremely important > > >> that > > >> > > context reload must be synchronised. > > >> > > A synchronised boolean check + possible context reload can consume > > >> some > > >> > > time but I wouldn't expect any significant performance drop. > > >> > > > > >> > > My proposal main drivers: > > >> > > * Original code path must run when no watch service asked > > >> > > * Super fast event handling because million events may come in > (not > > >> > > expecting but we should be prepared) > > >> > > * Clean separation between dir/file watch and file/dir usage > > >> > > * Well considered synchronisation model > > >> > > * Extensive unit testing because we're intended to touch the heart > > of > > >> > Flink > > >> > > > > >> > > Happy to hear other opinions. > > >> > > > > >> > > BR, > > >> > > G > > >> > > > > >> > > > > >> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison > > >> > > <nicolas.frai...@datadoghq.com.invalid> wrote: > > >> > > > > >> > > > Thanks all for your feedback and sorry for the late answer (I > was > > on > > >> > > > holiday). > > >> > > > > > >> > > > 1. Indeed it would add 4 threads. > > >> > > > For the non pekko component we can indeed have one watcher > service > > >> used > > >> > > to > > >> > > > reload SSLContext for those components > > >> > > > For pekko this is a little more challenging as the creation of > the > > >> > pekko > > >> > > > ssl engine is managed by pekko himself. > > >> > > > Flink only generates appropriate config with class to execute to > > >> > initiate > > >> > > > the pekko ssl engine [1]. This means that I will not be able to > > >> provide > > >> > > the > > >> > > > watcher service to this ssl engine. > > >> > > > One solution would be to rely on a singleton instead of a > service > > >> > > injected > > >> > > > in each component but I'm not sure this is fine to use such in > > >> flink. > > >> > > > WDYT? > > >> > > > > > >> > > > We can also add a specific flink configuration > > >> > > > (security.ssl.internal.keystore.reload.enable) to only add this > > >> watcher > > >> > > > mechanism if the config is enabled to avoid adding those threads > > if > > >> > this > > >> > > is > > >> > > > not needed. > > >> > > > > > >> > > > > > >> > > > 2. I'm fine with the LocalFSWatchService naming. > > >> > > > > > >> > > > 3. Also agree that some e2e tests must be added. > > >> > > > > > >> > > > > > >> > > > @doguscan namal, Thanks for challenging the proposal. > > >> > > > FYI, we are planning to rely on certificates with really short > > >> > validity. > > >> > > > > > >> > > > D1. It seems that the proposal to rely on a reload period will > > still > > >> > face > > >> > > > potential issues: > > >> > > > - receiving events while file content updates are still in > > progress: > > >> > > > There is no guarantee that we will not load the certificate > while > > >> file > > >> > > > content updates are still in progress > > >> > > > - while it will not be affected by multiple notifications, we > can > > >> > reach a > > >> > > > point where only the truststore is updated when the reload > > happens. > > >> > > > Which means that if the keystore is updated just after, it will > > not > > >> be > > >> > > > taken in account before next run of the reload mechanism > > >> > > > > > >> > > > I think that with WatchService and appropriate reload grace > period > > >> > > > mechanism we should be able to mitigate those 2 issues (ensuring > > >> > minimum > > >> > > > reload even with multiple notify) > > >> > > > > > >> > > > From KIP-1119 [2] it looks like the same kind of requirements is > > >> under > > >> > > > discussion for Kafka to also rely on the WatchService Java API > > >> > > (SpringBoot > > >> > > > seems to also rely on this API to manage ssl reload [3]). > > >> > > > > > >> > > > D3. Do we have a real use case for this? > > >> > > > > > >> > > > [1] > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372 > > >> > > > > > >> > > > [2] > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload > > >> > > > > > >> > > > [3] > > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading > > >> > > > > > >> > > > Regards, > > >> > > > Nicolas > > >> > > > > > >> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal < > > >> > > namal.dogus...@gmail.com> > > >> > > > wrote: > > >> > > > > > >> > > > > Hi Nicolas, thanks for the FLIP. > > >> > > > > > > >> > > > > I am fully supportive of the motivation and we should be > > >> supporting > > >> > > this > > >> > > > > feature. Here are couple of comments from my side: > > >> > > > > > > >> > > > > D1) > > >> > > > > Since you shared the implementation details on the FLIP as > > well, I > > >> > > would > > >> > > > > like to discuss whether using Java's WatchService is the best > > >> choice > > >> > > > here. > > >> > > > > > > >> > > > > I believe that the certificate renewal is not a frequent > > >> operation. > > >> > > Even > > >> > > > > once a day certificate renewals are not realistic(except for > > test > > >> > cases > > >> > > > > maybe) but let's assume that this covers up the p99.99 of the > > use > > >> > > cases. > > >> > > > I > > >> > > > > am confident on this estimation since there hasn't been a > > request > > >> > from > > >> > > > the > > >> > > > > community for this feature so far, which confirms that people > > were > > >> > okay > > >> > > > > with infrequent cluster restarts. Following that it is > > >> infrequent, I > > >> > > > > believe that spawning up a thread that watches the file > > >> modification > > >> > > > > operations is not the best use of the limited resources on a > > >> cluster. > > >> > > > > > > >> > > > > There are some known limitations of the WatchService as well > > such > > >> as > > >> > > > > receiving multiple modification events for the same occurence > > [1], > > >> > > > > inotify's (WatchService's underlying mechanism in Linux > > >> environments) > > >> > > > > problems on containerized environments due to remote file > > systems > > >> [2] > > >> > > or > > >> > > > > receiving events while file content updates are still in > > progress. > > >> > > [3]. I > > >> > > > > do not know if these limitations are addressed in the newer > > >> versions > > >> > > but > > >> > > > > regardless of that it is clear that we may face with some ugly > > >> edge > > >> > > cases > > >> > > > > due to that. > > >> > > > > > > >> > > > > Given these complications, I would recommend just creating a > new > > >> > > > SSLContext > > >> > > > > after a configured duration is expired. We could record the > > >> timestamp > > >> > > > when > > >> > > > > the previous SSLContext is created and update it after a > > >> configured > > >> > > > > duration is passed. This will be much easier to test and > reason > > >> about > > >> > > > when > > >> > > > > it is running on production. This will eliminate the necessity > > to > > >> > > reason > > >> > > > > about the file modification operations as well. > > >> > > > > > > >> > > > > I briefly skimmed through the classes that need to be modified > > >> and it > > >> > > > > looked feasible for me. Let me know what are your comments on > > >> these. > > >> > > > > > > >> > > > > Note that this is already used in the Kafka world where a new > > >> > > SSLContext > > >> > > > is > > >> > > > > created after 12 hours. > > >> > > > > > > >> > > > > D2) We could provide a configuration to the user, such as > > >> > > > > "security.ssl.internal.keystore.reload.duration" so they could > > >> decide > > >> > > how > > >> > > > > often the new certificates should be loaded. > > >> > > > > > > >> > > > > D3) > > >> > > > > On the other hand, I wonder whether we should also handle > > >> supporting > > >> > > > > updating the file paths of the truststores and keystores under > > >> this > > >> > > FLIP > > >> > > > as > > >> > > > > well. Since the name of the FLIP is "Handle TLS Certificate > > >> Renewal" > > >> > I > > >> > > > > think we could bring that into scope too :) > > >> > > > > > > >> > > > > Thanks > > >> > > > > > > >> > > > > [1] > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event > > >> > > > > > > >> > > > > [2] > > https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ > > >> > > > > > > >> > > > > [3] > > >> > > > > > >> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html > > >> > > > > > > >> > > > > [4] See also Platform Dependencies - > > >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html > > >> > > > > > > >> > > > > > > >> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi < > > >> > gabor.g.somo...@gmail.com > > >> > > > > > >> > > > > wrote: > > >> > > > > > > >> > > > > > Hi Robert, > > >> > > > > > > > >> > > > > > Since I've added the same feature to the operator I'll take > a > > >> look > > >> > at > > >> > > > it. > > >> > > > > > Though it won't be lightning fast since I'm having several > > weeks > > >> > off. > > >> > > > > > > > >> > > > > > Your questions are valid especially considering the fact > that > > >> this > > >> > > > > feature > > >> > > > > > touches the hearth of the authentication so this must be > rock > > >> solid > > >> > > > > > in order to avoid grey hair :) > > >> > > > > > > > >> > > > > > (1) I would vote on a single service which is heavily unit > > >> tested > > >> > > with > > >> > > > > > all the possible combinations including threading. > > >> > > > > > > > >> > > > > > Some standalone app could be added to really play with it > > (that > > >> > would > > >> > > > > help > > >> > > > > > review). > > >> > > > > > I mean, create X files, start Y threads, and make > assertions. > > >> > > > > > The reason why I'm suggesting it is the fact that AFAIR the > > >> watch > > >> > > > service > > >> > > > > > is quite sensitive even in single thread. If we could do > this > > >> in a > > >> > > > finite > > >> > > > > > time > > >> > > > > > consuming unit test then it's even better. > > >> > > > > > > > >> > > > > > (2) +1 on that name to avoid confusion > > >> > > > > > > > >> > > > > > (3) I agree that some e2e is must, however this can be > easily > > >> and > > >> > > > deeply > > >> > > > > > unit > > >> > > > > > tested so that part is also essential. One key test here is > > when > > >> > > > > > certificates > > >> > > > > > are not changing then no action must be performed (not to > > break > > >> the > > >> > > > whole > > >> > > > > > system apart). > > >> > > > > > > > >> > > > > > Purely personal opinion but such feature developments are > slow > > >> by > > >> > > > nature > > >> > > > > > because > > >> > > > > > of edge case / stress testing. > > >> > > > > > > > >> > > > > > BR, > > >> > > > > > G > > >> > > > > > > > >> > > > > > > > >> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger < > > >> > rmetz...@apache.org> > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Hi Nicolas, > > >> > > > > > > > > >> > > > > > > This looks like a nice improvement, thanks for the write > up. > > >> > > > > > > Are you in touch with any committer who's willing to > review > > / > > >> > merge > > >> > > > > this? > > >> > > > > > > > > >> > > > > > > Some random questions on the FLIP: > > >> > > > > > > (1) "Each service that depends on TLS certificates will > > >> > > initialize a > > >> > > > > > > FileSytemWatchService" > > >> > > > > > > > > >> > > > > > > It seems that there are 4 components using SSL, does this > > mean > > >> > > there > > >> > > > > will > > >> > > > > > > be 4 additional threads running, watching the same set of > > >> files? > > >> > > > > > > Wouldn't it be better to introduce a central file watching > > >> > service, > > >> > > > and > > >> > > > > > SSL > > >> > > > > > > users can subscribe to updates, to reduce the number of > > >> threads? > > >> > > > > > > If this makes the whole effort 4x more complicated, I > > wouldn't > > >> > > > consider > > >> > > > > > it, > > >> > > > > > > but if its roughly the same effort, we should :) > > >> > > > > > > > > >> > > > > > > (2) "FileSytemWatchService" > > >> > > > > > > When I read this name, I was wondering, whether this is > > >> somehow > > >> > > > related > > >> > > > > > to > > >> > > > > > > the Flink "FileSystem" classes. Which I think its' not. > > >> > > > > > > Maybe a different name, that makes this separation more > > >> explicit, > > >> > > > would > > >> > > > > > > make sense. Maybe "LocalFSWatchService"? > > >> > > > > > > (I'm sorry to bring up naming stuff -- its very > subjective, > > >> and > > >> > > > > > difficult) > > >> > > > > > > > > >> > > > > > > (3) For the test plan: There seem to be some SSL related > e2e > > >> > tests: > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh > > >> > > > > > > It would be nice to extend them to cover this feature as > > >> well. I > > >> > > > would > > >> > > > > > hate > > >> > > > > > > for this feature to slowly break by future changes, so > good > > >> e2e > > >> > > test > > >> > > > > > > coverage is key, in particular bc so many components are > > >> > involved. > > >> > > > > > > > > >> > > > > > > Best, > > >> > > > > > > Robert > > >> > > > > > > > > >> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison > > >> > > > > > > <nicolas.frai...@datadoghq.com.invalid> wrote: > > >> > > > > > > > > >> > > > > > > > Hi All, > > >> > > > > > > > > > >> > > > > > > > I'd like to start a discussion to Handle TLS Certificate > > >> > Renewal > > >> > > > > > > > Please provide some feedback on this proposal: > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal > > >> > > > > > > > > > >> > > > > > > > Regards, > > >> > > > > > > > > > >> > > > > > > > Nicolas Fraison > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > >