Hi Nicolas, thanks for the FLIP. I am fully supportive of the motivation and we should be supporting this feature. Here are couple of comments from my side:
D1) Since you shared the implementation details on the FLIP as well, I would like to discuss whether using Java's WatchService is the best choice here. I believe that the certificate renewal is not a frequent operation. Even once a day certificate renewals are not realistic(except for test cases maybe) but let's assume that this covers up the p99.99 of the use cases. I am confident on this estimation since there hasn't been a request from the community for this feature so far, which confirms that people were okay with infrequent cluster restarts. Following that it is infrequent, I believe that spawning up a thread that watches the file modification operations is not the best use of the limited resources on a cluster. There are some known limitations of the WatchService as well such as receiving multiple modification events for the same occurence [1], inotify's (WatchService's underlying mechanism in Linux environments) problems on containerized environments due to remote file systems [2] or receiving events while file content updates are still in progress. [3]. I do not know if these limitations are addressed in the newer versions but regardless of that it is clear that we may face with some ugly edge cases due to that. Given these complications, I would recommend just creating a new SSLContext after a configured duration is expired. We could record the timestamp when the previous SSLContext is created and update it after a configured duration is passed. This will be much easier to test and reason about when it is running on production. This will eliminate the necessity to reason about the file modification operations as well. I briefly skimmed through the classes that need to be modified and it looked feasible for me. Let me know what are your comments on these. Note that this is already used in the Kafka world where a new SSLContext is created after 12 hours. D2) We could provide a configuration to the user, such as "security.ssl.internal.keystore.reload.duration" so they could decide how often the new certificates should be loaded. D3) On the other hand, I wonder whether we should also handle supporting updating the file paths of the truststores and keystores under this FLIP as well. Since the name of the FLIP is "Handle TLS Certificate Renewal" I think we could bring that into scope too :) Thanks [1] https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event [2] https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ [3] https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html [4] See also Platform Dependencies - https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Hi Robert, > > Since I've added the same feature to the operator I'll take a look at it. > Though it won't be lightning fast since I'm having several weeks off. > > Your questions are valid especially considering the fact that this feature > touches the hearth of the authentication so this must be rock solid > in order to avoid grey hair :) > > (1) I would vote on a single service which is heavily unit tested with > all the possible combinations including threading. > > Some standalone app could be added to really play with it (that would help > review). > I mean, create X files, start Y threads, and make assertions. > The reason why I'm suggesting it is the fact that AFAIR the watch service > is quite sensitive even in single thread. If we could do this in a finite > time > consuming unit test then it's even better. > > (2) +1 on that name to avoid confusion > > (3) I agree that some e2e is must, however this can be easily and deeply > unit > tested so that part is also essential. One key test here is when > certificates > are not changing then no action must be performed (not to break the whole > system apart). > > Purely personal opinion but such feature developments are slow by nature > because > of edge case / stress testing. > > BR, > G > > > On Fri, Apr 18, 2025 at 4:53 PM Robert Metzger <rmetz...@apache.org> > wrote: > > > Hi Nicolas, > > > > This looks like a nice improvement, thanks for the write up. > > Are you in touch with any committer who's willing to review / merge this? > > > > Some random questions on the FLIP: > > (1) "Each service that depends on TLS certificates will initialize a > > FileSytemWatchService" > > > > It seems that there are 4 components using SSL, does this mean there will > > be 4 additional threads running, watching the same set of files? > > Wouldn't it be better to introduce a central file watching service, and > SSL > > users can subscribe to updates, to reduce the number of threads? > > If this makes the whole effort 4x more complicated, I wouldn't consider > it, > > but if its roughly the same effort, we should :) > > > > (2) "FileSytemWatchService" > > When I read this name, I was wondering, whether this is somehow related > to > > the Flink "FileSystem" classes. Which I think its' not. > > Maybe a different name, that makes this separation more explicit, would > > make sense. Maybe "LocalFSWatchService"? > > (I'm sorry to bring up naming stuff -- its very subjective, and > difficult) > > > > (3) For the test plan: There seem to be some SSL related e2e tests: > > > > > https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh > > It would be nice to extend them to cover this feature as well. I would > hate > > for this feature to slowly break by future changes, so good e2e test > > coverage is key, in particular bc so many components are involved. > > > > Best, > > Robert > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas Fraison > > <nicolas.frai...@datadoghq.com.invalid> wrote: > > > > > Hi All, > > > > > > I'd like to start a discussion to Handle TLS Certificate Renewal > > > Please provide some feedback on this proposal: > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal > > > > > > Regards, > > > > > > Nicolas Fraison > > > > > >