Thanks for the update Gabor. I'll take a look and respond in the document. Cheers, Till
On Wed, Jun 9, 2021 at 12:59 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > Hi Till, > > Your proxy suggestion has been considered in-depth and updated the FLIP > accordingly. > We've considered 2 proxy implementation (Nginx and Squid) but according to > our analysis and testing it's not suitable for the mentioned use-cases. > Please take a look at the rejected alternatives for detailed explanation. > > Thanks for your time in advance! > > BR, > G > > > On Fri, Jun 4, 2021 at 3:31 PM Till Rohrmann <trohrm...@apache.org> wrote: > >> As I've said I am not a security expert and that's why I have to ask for >> clarification, Gabor. You are saying that if we configure a truststore for >> the REST endpoint with a single trusted certificate which has been >> generated by the operator of the Flink cluster, then the attacker can >> generate a new certificate, sign it and then talk to the Flink cluster if >> he has access to the node on which the REST endpoint runs? My understanding >> was that you need the corresponding private key which in my proposed setup >> would be under the control of the operator as well (e.g. stored in a >> keystore on the same machine but guarded by some secret). That way (if I am >> not mistaken), only the entity which has access to the keystore is able to >> talk to the Flink cluster. >> >> Maybe we are also getting our wires crossed here and are talking about >> different things. >> >> Thanks for listing the pros and cons of Kerberos. Concerning what other >> authentication mechanisms are used in the industry, I am not 100% sure. >> >> Cheers, >> Till >> >> On Fri, Jun 4, 2021 at 11:09 AM Gabor Somogyi <gabor.g.somo...@gmail.com> >> wrote: >> >>> > I did not mean for the user to sign its own certificates but for the >>> operator of the cluster. Once the user request hits the proxy, it should no >>> longer be under his control. I think I do not fully understand yet why this >>> would not work. >>> I said it's not solving the authentication problem over any proxy. Even >>> if the operator is signing the certificate one can have access to an >>> internal node. >>> Such case anybody can craft certificates which is accepted by the >>> server. When it's accepted a bad guy can cancel jobs causing huge impacts. >>> >>> > Also, I am missing a bit the comparison of Kerberos to other >>> authentication mechanisms and why they were rejected in favour of Kerberos. >>> PROS: >>> * Since it's not depending on cloud provider and/or k8s or bare-metal >>> etc. deployment it's the biggest plus >>> * Centralized with tools and no need to write tons of tools around >>> * There are clients/tools on almost all OS-es and several languages >>> * Super huge users are using it for years in production w/o huge issues >>> * Provides cross-realm trust possibility amongst other features >>> * Several open source components using it which could increase >>> compatibility >>> >>> CONS: >>> * Not everybody using kerberos >>> * It would increase the code footprint but this is true for many >>> features (as a side note I'm here to maintain it) >>> >>> Feel free to add your points because it only represents a single >>> viewpoint. >>> Also if you have any better option for strong authentication please >>> share it and we can consider the pros/cons here. >>> >>> BR, >>> G >>> >>> >>> On Fri, Jun 4, 2021 at 10:32 AM Till Rohrmann <trohrm...@apache.org> >>> wrote: >>> >>>> I did not mean for the user to sign its own certificates but for the >>>> operator of the cluster. Once the user request hits the proxy, it should no >>>> longer be under his control. I think I do not fully understand yet why this >>>> would not work. >>>> >>>> What I would like to avoid is to add more complexity into Flink if >>>> there is an easy solution which fulfills the requirements. That's why I >>>> would like to exercise thoroughly through the different alternatives. Also, >>>> I am missing a bit the comparison of Kerberos to other authentication >>>> mechanisms and why they were rejected in favour of Kerberos. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Fri, Jun 4, 2021 at 10:26 AM Gyula Fóra <gyf...@apache.org> wrote: >>>> >>>>> Hi! >>>>> >>>>> I think there might be possible alternatives but it seems Kerberos on >>>>> the rest endpoint ticks all the right boxes and provides a super clean and >>>>> simple solution for strong authentication. >>>>> >>>>> I wouldn’t even consider sidecar proxies etc if we can solve it in >>>>> such a simple way as proposed by G. >>>>> >>>>> Cheers >>>>> Gyula >>>>> >>>>> On Fri, 4 Jun 2021 at 10:03, Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>>> I am not saying that we shouldn't add a strong authentication >>>>>> mechanism if there are good reasons for it. I primarily would like to >>>>>> understand the context a bit better in order to give qualified feedback >>>>>> and >>>>>> come to a good decision. In order to do this, I have the feeling that we >>>>>> haven't fully considered all available options which are on the table, >>>>>> tbh. >>>>>> >>>>>> Does the problem of certificate expiry also apply for self-signed >>>>>> certificates? If yes, then this should then also be a problem for the >>>>>> internal encryption of Flink's communication. If not, then one could use >>>>>> self-signed certificates with a longer validity to solve the mentioned >>>>>> issue. >>>>>> >>>>>> I think you can set up Flink in such a way that you don't have to >>>>>> handle all the different certificates. For example, you could deploy >>>>>> Flink >>>>>> with a "sidecar proxy" which is responsible for the authentication using >>>>>> an >>>>>> arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a >>>>>> local >>>>>> network interface. That way, the REST endpoint would only be available >>>>>> through the sidecar proxy. Additionally, one could enable SSL for this >>>>>> communication. Would this be a solution for the problem? >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi < >>>>>> balassi.mar...@gmail.com> wrote: >>>>>> >>>>>>> That is an interesting idea, Till. >>>>>>> >>>>>>> The main issue with it is that TLS certificates have an expiration >>>>>>> time, usually they get approved for a couple years. Forcing our users to >>>>>>> restart jobs to reprovision TLS certificates would be weird when we >>>>>>> could >>>>>>> just implement a single proper strong authentication mechanism instead >>>>>>> in a >>>>>>> couple hundred lines of code. :-) >>>>>>> >>>>>>> In many cases it is also impractical to go the TLS mutual route, >>>>>>> because the Flink Dashboard can end up on any node in the k8s/Yarn >>>>>>> cluster >>>>>>> which means that we need a certificate per node (due to the mutual >>>>>>> auth), >>>>>>> but if we also want to protect the private key of these from users >>>>>>> accidentally or intentionally leaking them then we need this per user. >>>>>>> As >>>>>>> in we end up managing user*machine number certificates and having to >>>>>>> renew >>>>>>> them periodically, which albeit automatable is unfortunately not yet >>>>>>> automated in all large organizations. >>>>>>> >>>>>>> I fully agree that TLS certificate mutual authentication has its >>>>>>> nice properties, especially at very large (multiple thousand node) >>>>>>> clusters >>>>>>> - but it has its own challenges too. Thanks for bringing it up. >>>>>>> >>>>>>> Happy to have this added to the rejected alternative list so that we >>>>>>> have the full picture documented. >>>>>>> >>>>>>> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <trohrm...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I guess the idea would then be to let the proxy do the >>>>>>>> authentication job and only forward the request via an SSL mutually >>>>>>>> encrypted connection to the Flink cluster. Would this be possible? The >>>>>>>> beauty of this setup is in my opinion that this setup should work with >>>>>>>> all >>>>>>>> kinds of authentication mechanisms. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Till >>>>>>>> >>>>>>>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi < >>>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Thanks for giving options to fulfil the need. >>>>>>>>> >>>>>>>>> Users are looking for a solution where users can be identified on >>>>>>>>> the whole cluster and restrict access to resources/actions. >>>>>>>>> A good example for such an action is cancelling other users >>>>>>>>> running jobs. >>>>>>>>> >>>>>>>>> * SSL does provide mutual authentication but when authentication >>>>>>>>> passed there is no user based on restrictions can be made. >>>>>>>>> * The less problematic part is that generating/maintaining short >>>>>>>>> time valid certificates would be a hard (that's the reason KDC like >>>>>>>>> servers >>>>>>>>> exist). >>>>>>>>> Having long time valid certificates would widen the attack surface >>>>>>>>> but since the first concern is there this is just a cosmetic issue. >>>>>>>>> >>>>>>>>> All in all using TLS certificates is not sufficient in these >>>>>>>>> environments unfortunately. >>>>>>>>> >>>>>>>>> BR, >>>>>>>>> G >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann < >>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the information Gabor. If it is about securing the >>>>>>>>>> communication between the REST client and the REST server, then Flink >>>>>>>>>> already supports enabling mutual SSL authentication [1]. Would this >>>>>>>>>> be >>>>>>>>>> enough to secure the communication and to pass an audit? >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Till >>>>>>>>>> >>>>>>>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi < >>>>>>>>>> gabor.g.somo...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Till, >>>>>>>>>>> >>>>>>>>>>> Since I'm working in security area 10+ years let me share my >>>>>>>>>>> thought. >>>>>>>>>>> I would like to emphasise there are experts better than me but I >>>>>>>>>>> have some >>>>>>>>>>> basics. >>>>>>>>>>> The discussion is open and not trying to tell alone things... >>>>>>>>>>> >>>>>>>>>>> > I mean if an attacker can get access to one of the machines, >>>>>>>>>>> then it >>>>>>>>>>> should also be possible to obtain the right Kerberos token. >>>>>>>>>>> Not necessarily. For example if one gets access to a specific >>>>>>>>>>> user's >>>>>>>>>>> credentials then it's not possible to compromise other user's >>>>>>>>>>> jobs, data, >>>>>>>>>>> etc... >>>>>>>>>>> Security is like an onion, the more layers has been added the >>>>>>>>>>> more time an >>>>>>>>>>> attacker needs to proceed. >>>>>>>>>>> At the end of the day if one is in, then most probably can find >>>>>>>>>>> the way but >>>>>>>>>>> this time is normally enough to sysadmins or security experts to >>>>>>>>>>> close down the system and minimize the damage. >>>>>>>>>>> >>>>>>>>>>> The other thing is that all tokens has a timeout and if the >>>>>>>>>>> token is >>>>>>>>>>> invalid then the attacker can't proceed further. >>>>>>>>>>> >>>>>>>>>>> > Is Kerberos also the standard authentication protocol for >>>>>>>>>>> Kubernetes >>>>>>>>>>> deployments? >>>>>>>>>>> Kerberos is an industry standard which is cloud/deployment >>>>>>>>>>> agnostic and it >>>>>>>>>>> can be used in any deployments including k8s. >>>>>>>>>>> The main intention is to use kerberos in k8s deployments too >>>>>>>>>>> since we're >>>>>>>>>>> going this direction as well. >>>>>>>>>>> Please see how Spark does this: >>>>>>>>>>> >>>>>>>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes >>>>>>>>>>> >>>>>>>>>>> Last but not least the most important reason to add at least one >>>>>>>>>>> strong >>>>>>>>>>> authentication is that we have users who has >>>>>>>>>>> hard requirements on this. They're doing security audits and if >>>>>>>>>>> they fail >>>>>>>>>>> then it's deal breaking. >>>>>>>>>>> That is why we have added kerberos at the first place. >>>>>>>>>>> Unfortunately we >>>>>>>>>>> can't name them in this public list, however >>>>>>>>>>> the customers who specifically asked for this were mainly in the >>>>>>>>>>> banking >>>>>>>>>>> and telco sector. >>>>>>>>>>> >>>>>>>>>>> BR, >>>>>>>>>>> G >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann < >>>>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>> > Thanks for updating the document Márton. Why is it that banks >>>>>>>>>>> will >>>>>>>>>>> > consider it more secure if Flink comes with Kerberos >>>>>>>>>>> authentication >>>>>>>>>>> > (assuming a properly secured setup)? I mean if an attacker can >>>>>>>>>>> get access >>>>>>>>>>> > to one of the machines, then it should also be possible to >>>>>>>>>>> obtain the right >>>>>>>>>>> > Kerberos token. >>>>>>>>>>> > >>>>>>>>>>> > I am not an authentication expert and that's why I wanted to >>>>>>>>>>> ask what are >>>>>>>>>>> > other authentication protocols other than Kerberos? Why did we >>>>>>>>>>> select >>>>>>>>>>> > Kerberos and not any other authentication protocol? Maybe you >>>>>>>>>>> can list the >>>>>>>>>>> > pros and cons for the different protocols. Is Kerberos also >>>>>>>>>>> the standard >>>>>>>>>>> > authentication protocol for Kubernetes deployments? If not, >>>>>>>>>>> what would be >>>>>>>>>>> > the answer when deploying on K8s? >>>>>>>>>>> > >>>>>>>>>>> > Cheers, >>>>>>>>>>> > Till >>>>>>>>>>> > >>>>>>>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi < >>>>>>>>>>> gabor.g.somo...@gmail.com> >>>>>>>>>>> > wrote: >>>>>>>>>>> > >>>>>>>>>>> >> Hi team, >>>>>>>>>>> >> >>>>>>>>>>> >> Happy to be here and hope I can provide quality additions in >>>>>>>>>>> the future. >>>>>>>>>>> >> >>>>>>>>>>> >> Thank you all for helpful the suggestions! >>>>>>>>>>> >> Considering them the FLIP has been modified and the work >>>>>>>>>>> continues on the >>>>>>>>>>> >> already existing Jira. >>>>>>>>>>> >> >>>>>>>>>>> >> BR, >>>>>>>>>>> >> G >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi < >>>>>>>>>>> balassi.mar...@gmail.com> >>>>>>>>>>> >> wrote: >>>>>>>>>>> >> >>>>>>>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the >>>>>>>>>>> ticket too, let >>>>>>>>>>> >>> us continue there then. >>>>>>>>>>> >>> >>>>>>>>>>> >>> Till, I agree that we should keep this codepath as slim as >>>>>>>>>>> possible. It >>>>>>>>>>> >>> is an important design decision that we aim to keep the list >>>>>>>>>>> of >>>>>>>>>>> >>> authentication protocols to a minimum. We believe that this >>>>>>>>>>> should not be a >>>>>>>>>>> >>> primary concern of Flink and a trusted proxy service (for >>>>>>>>>>> example Apache >>>>>>>>>>> >>> Knox) should be used to enable a multitude of enduser >>>>>>>>>>> authentication >>>>>>>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to >>>>>>>>>>> support >>>>>>>>>>> >>> consequently consist of a single strong authentication >>>>>>>>>>> protocol for which >>>>>>>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary >>>>>>>>>>> for development >>>>>>>>>>> >>> and light-weight scenarios. >>>>>>>>>>> >>> >>>>>>>>>>> >>> Added the above wording to G's doc. >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> >>>>>>>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler < >>>>>>>>>>> ches...@apache.org> >>>>>>>>>>> >>> wrote: >>>>>>>>>>> >>> >>>>>>>>>>> >>>> There's a related effort: >>>>>>>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108 >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote: >>>>>>>>>>> >>>> > Hi Gabor, welcome to the Flink community! >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > Thanks for sharing this proposal with the community >>>>>>>>>>> Márton. In >>>>>>>>>>> >>>> general, I >>>>>>>>>>> >>>> > agree that authentication is missing and that this is >>>>>>>>>>> required for >>>>>>>>>>> >>>> using >>>>>>>>>>> >>>> > Flink within an enterprise. The thing I am wondering is >>>>>>>>>>> whether this >>>>>>>>>>> >>>> > feature strictly needs to be implemented inside of Flink >>>>>>>>>>> or whether a >>>>>>>>>>> >>>> proxy >>>>>>>>>>> >>>> > setup could do the job? Have you considered this option? >>>>>>>>>>> If yes, then >>>>>>>>>>> >>>> it >>>>>>>>>>> >>>> > would be good to list it under the point of rejected >>>>>>>>>>> alternatives. >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > I do see the benefit of implementing this feature inside >>>>>>>>>>> of Flink if >>>>>>>>>>> >>>> many >>>>>>>>>>> >>>> > users need it. If not, then it might be easier for the >>>>>>>>>>> project to not >>>>>>>>>>> >>>> > increase the surface area since it makes the overall >>>>>>>>>>> maintenance >>>>>>>>>>> >>>> harder. >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > Cheers, >>>>>>>>>>> >>>> > Till >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi < >>>>>>>>>>> mbala...@apache.org> >>>>>>>>>>> >>>> wrote: >>>>>>>>>>> >>>> > >>>>>>>>>>> >>>> >> Hi team, >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for >>>>>>>>>>> short to the >>>>>>>>>>> >>>> >> community, he is a Spark committer who has recently >>>>>>>>>>> transitioned to >>>>>>>>>>> >>>> the >>>>>>>>>>> >>>> >> Flink Engineering team at Cloudera and is looking >>>>>>>>>>> forward to >>>>>>>>>>> >>>> contributing >>>>>>>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark >>>>>>>>>>> Streaming >>>>>>>>>>> >>>> and >>>>>>>>>>> >>>> >> security. >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >> Based on requests from our customers G has implemented >>>>>>>>>>> Kerberos and >>>>>>>>>>> >>>> HTTP >>>>>>>>>>> >>>> >> Basic Authentication for the Flink Dashboard and >>>>>>>>>>> HistoryServer. >>>>>>>>>>> >>>> Previously >>>>>>>>>>> >>>> >> lacked an authentication story. >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >> We are looking to contribute this functionality back to >>>>>>>>>>> the >>>>>>>>>>> >>>> community, we >>>>>>>>>>> >>>> >> believe that given Flink's maturity there should be a >>>>>>>>>>> common code >>>>>>>>>>> >>>> solution >>>>>>>>>>> >>>> >> for this general pattern. >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >> We are looking forward to your feedback on G's design. >>>>>>>>>>> [2] >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >> [1] http://gaborsomogyi.com/ >>>>>>>>>>> >>>> >> [2] >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >>>>>>>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>>>>>>>> >>>> >> >>>>>>>>>>> >>>> >>>>>>>>>>> >>>> >>>>>>>>>>> >>>>>>>>>>