Hi All, First of all sorry that I've taken couple of mails heavily! I've had an impression after we've invested roughly 2 months into the FLIP it's moving to a rejection without alternative what we can work on.
That said earlier which still stands if there is a better idea how that could be solved I'm open even with the price of rejecting this. What I would like to ask even in case of suggestions/or even reject please come up with a concrete proposal on what we can agree on. During this 2 months I've considered many options and this is the design/code which contains the least necessary lines of code, relatively rock stable in production in another product, I personally have roughly 3 years experience with it. The design is not 1to1 copy-paste because I've considered my limited knowledge about Flink. Since I'm not the one who has 7+ years within Flink I can accept if something is not the way it should be done. Please suggest a better way and I'm sure we're going to come up with something which makes everybody happy. So waiting on the suggestions and we drive the ship there... G On Fri, Feb 4, 2022 at 12:08 AM Till Rohrmann <trohrm...@apache.org> wrote: > Sorry I didn't want to offend anybody if it was perceived like this. I can > see that me joining very late into the discussion w/o constructive ideas > was not nice. My motivation for asking for the reasoning behind the current > design proposal is primarily the lack of Kerberos knowledge. Moreover, it > happened before that we moved responsibilities into Flink that we regretted > later. > > As I've said, I don't have a better idea right now. If we believe that it > is the right thing to make Flink responsible for distributing the tokens > and we don't find a better solution then we'll go for it. I just wanted to > make sure that we don't overlook an alternative solution that might be > easier to maintain in the long run. > > Cheers, > Till > > On Thu, Feb 3, 2022 at 7:52 PM Gyula Fóra <gyula.f...@gmail.com> wrote: > > > Hi Team! > > > > Let's all calm down a little and not let our emotions affect the > discussion > > too much. > > There has been a lot of effort spent from all involved parties so this is > > quite understandable :) > > > > Even though not everyone said this explicitly, it seems that everyone > more > > or less agrees that a feature implementing token renewal is necessary and > > valuable. > > > > The main point of contention is: where should the token renewal > > logic run and how to get the tokens to wherever needed. > > > > From my perspective the current design is very reasonable at first sight > > because: > > 1. It runs the token renewal in a single place avoiding extra CDC > workload > > 2. Does not introduce new processes, extra communication channels etc > but > > piggybacks on existing robust mechanisms. > > > > I understand the concerns about adding new things in the resource manager > > but I think that really depends on how we look at it. > > We cannot reasonably expect a custom token renewal process to have it's > own > > secure distribution logic like Flink has now, that is a complete > overkill. > > This practically means that we will not have a slim efficient > > implementation for this but something unnecessarily complex. And the only > > thing we get in return is a bit less code in the resource manager. > > > > From a logical standpoint the delegation framework needs to run at a > > centralized place and need to be able to access new task manager > processes > > to achieve all it's design goals. > > We can drop a single renewer as a design goal but that might be a > decision > > that can affect large scale production runs. > > > > Cheers, > > Gyula > > > > > > > > > > On Thu, Feb 3, 2022 at 7:32 PM Chesnay Schepler <ches...@apache.org> > > wrote: > > > > > First of, at no point have we questioned the use-case and importance of > > > this feature, and the fact that David, Till and me spent time looking > at > > > the FLIP, asking questions, and discussing different aspects of it > > > should make this obvious. > > > > > > I'd appreciate it if you didn't dismiss our replies that quickly. > > > > > > > Ok, so we declare that users who try to use delegation tokens in > > > Flink is dead end code and not supported, right? > > > > > > No one has said that. Are you claiming that your design is the /only > > > possible implementation/ that is capable of achieving the stated goals, > > > that there are 0 alternatives? On of the *main**points* of these > > > discussion threads is to discover alternative implementations that > maybe > > > weren't thought of. Yes, that may imply that we amend your design, or > > > reject it completely and come up with a new one. > > > > > > > > > Let's clarify what (I think) Till proposed to get the imagination juice > > > flowing. > > > > > > At the end of the day, all we need is a way to provide Flink processes > > > with a token that can be periodically updated. _Who_ issues that token > > > is irrelevant for the functionality to work. You are proposing for a > new > > > component in the Flink RM to do that; Till is proposing to have some > > > external process do it. *That's it*. > > > > > > How this could look like in practice is fairly straight forwad; add a > > > pluggable interface (aka, your TokenProvider thing) that is loaded in > > > each process, which can _somehow_ provide tokens that are then set in > > > the UserGroupInformation. > > > _How_ the provider receives token is up to the provider. It _may_ just > > > talk directly to Kerberos, or it could use some communication channel > to > > > accept tokens from the outside. > > > This would for example make it a lot easier to properly integrate this > > > into the lifecycle of the process, as we'd sidestep the whole "TM is > > > running but still needs a Token" issue; it could become a proper setup > > > step of the process that is independent from other Flink processes. > > > > > > /Discuss/. > > > > > > On 03/02/2022 18:57, Gabor Somogyi wrote: > > > >> And even > > > > if we do it like this, there is no guarantee that it works because > > there > > > > can be other applications bombing the KDC with requests. > > > > > > > > 1. The main issue to solve here is that workloads using delegation > > tokens > > > > are stopping after 7 days with default configuration. > > > > 2. This is not new design, it's rock stable and performing well in > > Spark > > > > for years. > > > > > > > >> From a > > > > maintainability and separation of concerns perspective I'd rather > have > > > this > > > > as some kind of external tool/service that makes KDC scale better and > > > that > > > > Flink processes can talk to to obtain the tokens. > > > > > > > > Ok, so we declare that users who try to use delegation tokens in > Flink > > is > > > > dead end code and not supported, right? Then this must be explicitely > > > > written in the security documentation that such users who use that > > > feature > > > > are left behind. > > > > > > > > As I see the discussion turned away from facts and started to speak > > about > > > > feelings. If you have strategic problems with the feature please put > > your > > > > -1 on the vote and we can spare quite some time. > > > > > > > > G > > > > > > > > > > > > On Thu, 3 Feb 2022, 18:34 Till Rohrmann,<trohrm...@apache.org> > wrote: > > > > > > > >> I don't have a good alternative solution but it sounds to me a bit > as > > > if we > > > >> are trying to solve Kerberos' scalability problems within Flink. And > > > even > > > >> if we do it like this, there is no guarantee that it works because > > there > > > >> can be other applications bombing the KDC with requests. From a > > > >> maintainability and separation of concerns perspective I'd rather > have > > > this > > > >> as some kind of external tool/service that makes KDC scale better > and > > > that > > > >> Flink processes can talk to to obtain the tokens. > > > >> > > > >> Cheers, > > > >> Till > > > >> > > > >> On Thu, Feb 3, 2022 at 6:01 PM Gabor Somogyi< > > gabor.g.somo...@gmail.com> > > > >> wrote: > > > >> > > > >>> Oh and the most important reason I've forgotten. > > > >>> Without the feature in the FLIP all secure workloads with > delegation > > > >> tokens > > > >>> are going to stop when tokens are reaching it's max lifetime 🙂 > > > >>> This is around 7 days with default config... > > > >>> > > > >>> On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi< > > gabor.g.somo...@gmail.com > > > > > > > >>> wrote: > > > >>> > > > >>>> That's not the single purpose of the feature but in some > > environments > > > >> it > > > >>>> caused problems. > > > >>>> The main intention is not to deploy keytab to all the nodes > because > > > the > > > >>>> attack surface is bigger + reduce the KDC load. > > > >>>> I've already described the situation previously in this thread so > > > >> copying > > > >>>> it here. > > > >>>> > > > >>>> --------COPY-------- > > > >>>> "KDC *may* collapse under some circumstances" is the proper > wording. > > > >>>> > > > >>>> We have several customers who are executing workloads on > > Spark/Flink. > > > >>> Most > > > >>>> of the time I'm facing their > > > >>>> daily issues which is heavily environment and use-case dependent. > > I've > > > >>>> seen various cases: > > > >>>> * where the mentioned ~1k nodes were working fine > > > >>>> * where KDC thought the number of requests are coming from DDOS > > attack > > > >> so > > > >>>> discontinued authentication > > > >>>> * where KDC was simply not responding because of the load > > > >>>> * where KDC was intermittently had some outage (this was the most > > > nasty > > > >>>> thing) > > > >>>> > > > >>>> Since you're managing relatively big cluster then you know that > KDC > > is > > > >>> not > > > >>>> only used by Spark/Flink workloads > > > >>>> but the whole company IT infrastructure is bombing it so it really > > > >>> depends > > > >>>> on other factors too whether KDC is reaching > > > >>>> it's limit or not. Not sure what kind of evidence are you looking > > for > > > >> but > > > >>>> I'm not authorized to share any information about > > > >>>> our clients data. > > > >>>> > > > >>>> One thing is for sure. The more external system types are used in > > > >>>> workloads (for ex. HDFS, HBase, Hive, Kafka) which > > > >>>> are authenticating through KDC the more possibility to reach this > > > >>>> threshold when the cluster is big enough. > > > >>>> --------COPY-------- > > > >>>> > > > >>>>> The FLIP mentions scaling issues with 200 nodes; it's really > > > >> surprising > > > >>>> to me that such a small number of requests can already cause > issues. > > > >>>> > > > >>>> One node/task doesn't mean 1 request. The following type of > kerberos > > > >> auth > > > >>>> types has been seen by me which can run at the same time: > > > >>>> HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...) > > > >> Additionally > > > >>>> one task is not necessarily opens 1 connection. > > > >>>> > > > >>>> All in all I don't have steps to reproduce but we've faced this > > > >>> already... > > > >>>> G > > > >>>> > > > >>>> > > > >>>> On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler< > ches...@apache.org> > > > >>>> wrote: > > > >>>> > > > >>>>> What I don't understand is how this could overload the KDC. > Aren't > > > >>>>> tokens valid for a relatively long time period? > > > >>>>> > > > >>>>> For new deployments where many TMs are started at once I could > > > imagine > > > >>>>> it temporarily, but shouldn't the accesses to the KDC eventually > > > >>>>> naturally spread out? > > > >>>>> > > > >>>>> The FLIP mentions scaling issues with 200 nodes; it's really > > > >> surprising > > > >>>>> to me that such a small number of requests can already cause > > issues. > > > >>>>> > > > >>>>> On 03/02/2022 16:14, Gabor Somogyi wrote: > > > >>>>>>> I would prefer not choosing the first option > > > >>>>>> Then the second option may play only. > > > >>>>>> > > > >>>>>>> I am not a Kerberos expert but is it really so that every > > > >> application > > > >>>>> that > > > >>>>>> wants to use Kerberos needs to implement the token propagation > > > >> itself? > > > >>>>> This > > > >>>>>> somehow feels as if there is something missing. > > > >>>>>> > > > >>>>>> OK, so first some kerberos + token intro. > > > >>>>>> > > > >>>>>> Some basics: > > > >>>>>> * TGT can be created from keytab > > > >>>>>> * TGT is needed to obtain TGS (called token) > > > >>>>>> * Authentication only works with TGS -> all places where > external > > > >>>>> system is > > > >>>>>> needed either a TGT or TGS needed > > > >>>>>> > > > >>>>>> There are basically 2 ways to authenticate to a kerberos secured > > > >>>>> external > > > >>>>>> system: > > > >>>>>> 1. One needs a kerberos TGT which MUST be propagated to all > JVMs. > > > >> Here > > > >>>>> each > > > >>>>>> and every JVM obtains a TGS by itself which bombs the KDC that > may > > > >>>>> collapse. > > > >>>>>> 2. One needs a kerberos TGT which exists only on a single place > > (in > > > >>> this > > > >>>>>> case JM). JM gets a TGS which MUST be propagated to all TMs > > because > > > >>>>>> otherwise authentication fails. > > > >>>>>> > > > >>>>>> Now the whole system works in a way that keytab file (we can > > imagine > > > >>>>> that > > > >>>>>> as plaintext password) is reachable on all nodes. > > > >>>>>> This is a relatively huge attack surface. Now the main intention > > is: > > > >>>>>> * Instead of propagating keytab file to all nodes propagate a > TGS > > > >>> which > > > >>>>> has > > > >>>>>> limited lifetime (more secure) > > > >>>>>> * Do the TGS generation in a single place so KDC may not > collapse > > + > > > >>>>> having > > > >>>>>> keytab only on a single node can be better protected > > > >>>>>> > > > >>>>>> As a final conclusion if there is a place which expects to do > > > >> kerberos > > > >>>>>> authentication then it's a MUST to have either TGT or TGS. > > > >>>>>> Now it's done in a pretty unsecure way. The questions are the > > > >>> following: > > > >>>>>> * Do we want to leave this unsecure keytab propagation like this > > and > > > >>>>> bomb > > > >>>>>> KDC? > > > >>>>>> * If no then how do we propagate the more secure token to TMs. > > > >>>>>> > > > >>>>>> If the answer to the first question is no then the FLIP can be > > > >>> abandoned > > > >>>>>> and doesn't worth the further effort. > > > >>>>>> If the answer is yes then we can talk about the how part. > > > >>>>>> > > > >>>>>> G > > > >>>>>> > > > >>>>>> > > > >>>>>> On Thu, Feb 3, 2022 at 3:42 PM Till Rohrmann< > trohrm...@apache.org > > > > > > >>>>> wrote: > > > >>>>>>> I would prefer not choosing the first option > > > >>>>>>> > > > >>>>>>>> Make the TM accept tasks only after registration(not sure if > > it's > > > >>>>>>> possible or makes sense at all) > > > >>>>>>> > > > >>>>>>> because it effectively means that we change how Flink's > component > > > >>>>> lifecycle > > > >>>>>>> works for distributing Kerberos tokens. It also effectively > means > > > >>> that > > > >>>>> a TM > > > >>>>>>> cannot make progress until connected to a RM. > > > >>>>>>> > > > >>>>>>> I am not a Kerberos expert but is it really so that every > > > >> application > > > >>>>> that > > > >>>>>>> wants to use Kerberos needs to implement the token propagation > > > >>> itself? > > > >>>>> This > > > >>>>>>> somehow feels as if there is something missing. > > > >>>>>>> > > > >>>>>>> Cheers, > > > >>>>>>> Till > > > >>>>>>> > > > >>>>>>> On Thu, Feb 3, 2022 at 3:29 PM Gabor Somogyi < > > > >>>>> gabor.g.somo...@gmail.com> > > > >>>>>>> wrote: > > > >>>>>>> > > > >>>>>>>>> Isn't this something the underlying resource management > > system > > > >>>>> could > > > >>>>>>> do > > > >>>>>>>> or which every process could do on its own? > > > >>>>>>>> > > > >>>>>>>> I was looking for such feature but not found. > > > >>>>>>>> Maybe we can solve the propagation easier but then I'm waiting > > on > > > >>>>> better > > > >>>>>>>> suggestion. > > > >>>>>>>> If anybody has better/more simple idea then please point to a > > > >>> specific > > > >>>>>>>> feature which works on all resource management systems. > > > >>>>>>>> > > > >>>>>>>>> Here's an example for the TM to run workloads without being > > > >>> connected > > > >>>>>>>> to the RM, without ever having a valid token > > > >>>>>>>> > > > >>>>>>>> All in all I see the main problem. Not sure what is the reason > > > >>> behind > > > >>>>>>> that > > > >>>>>>>> a TM accepts tasks w/o registration but clearly not helping > > here. > > > >>>>>>>> I basically see 2 possible solutions: > > > >>>>>>>> * Make the TM accept tasks only after registration(not sure if > > > >> it's > > > >>>>>>>> possible or makes sense at all) > > > >>>>>>>> * We send tokens right after container creation with > > > >>>>>>>> "updateDelegationTokens" > > > >>>>>>>> Not sure which one is more realistic to do since I'm not > > involved > > > >>> the > > > >>>>> new > > > >>>>>>>> feature. > > > >>>>>>>> WDYT? > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>>> On Thu, Feb 3, 2022 at 3:09 PM Till Rohrmann < > > > >> trohrm...@apache.org> > > > >>>>>>> wrote: > > > >>>>>>>>> Hi everyone, > > > >>>>>>>>> > > > >>>>>>>>> Sorry for joining this discussion late. I also did not read > all > > > >>>>>>> responses > > > >>>>>>>>> in this thread so my question might already be answered: Why > > does > > > >>>>> Flink > > > >>>>>>>>> need to be involved in the propagation of the tokens? Why do > we > > > >>> need > > > >>>>>>>>> explicit RPC calls in the Flink domain? Isn't this something > > the > > > >>>>>>> underlying > > > >>>>>>>>> resource management system could do or which every process > > could > > > >> do > > > >>>>> on > > > >>>>>>> its > > > >>>>>>>>> own? I am a bit worried that we are making Flink responsible > > for > > > >>>>>>> something > > > >>>>>>>>> that it is not really designed to do so. > > > >>>>>>>>> > > > >>>>>>>>> Cheers, > > > >>>>>>>>> Till > > > >>>>>>>>> > > > >>>>>>>>> On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler < > > > >>> ches...@apache.org> > > > >>>>>>>>> wrote: > > > >>>>>>>>> > > > >>>>>>>>>> Here's an example for the TM to run workloads without being > > > >>>>> connected > > > >>>>>>> to > > > >>>>>>>>>> the RM, while potentially having a valid token: > > > >>>>>>>>>> > > > >>>>>>>>>> 1. TM registers at RM > > > >>>>>>>>>> 2. JobMaster requests slot from RM -> TM gets notified > > > >>>>>>>>>> 3. JM fails over > > > >>>>>>>>>> 4. TM re-offers the slot to the failed over JobMaster > > > >>>>>>>>>> 5. TM reconnects to RM at some point > > > >>>>>>>>>> > > > >>>>>>>>>> Here's an example for the TM to run workloads without being > > > >>>>> connected > > > >>>>>>> to > > > >>>>>>>>>> the RM, without ever having a valid token: > > > >>>>>>>>>> > > > >>>>>>>>>> 1. TM1 has a valid token and is running some tasks. > > > >>>>>>>>>> 2. TM1 crashes > > > >>>>>>>>>> 3. TM2 is started to take over, and re-uses the working > > > >>> directory > > > >>>>> of > > > >>>>>>>>>> TM1 (new feature in 1.15!) > > > >>>>>>>>>> 4. TM2 recovers the previous slot allocations > > > >>>>>>>>>> 5. TM2 is informed about leading JM > > > >>>>>>>>>> 6. TM2 starts registration with RM > > > >>>>>>>>>> 7. TM2 offers slots to JobMaster > > > >>>>>>>>>> 8. TM2 accepts task submission from JobMaster > > > >>>>>>>>>> 9. ...some time later the registration completes... > > > >>>>>>>>>> > > > >>>>>>>>>> > > > >>>>>>>>>> On 03/02/2022 14:24, Gabor Somogyi wrote: > > > >>>>>>>>>>>> but it can happen that the JobMaster+TM collaborate to run > > > >> stuff > > > >>>>>>>>>>> without the TM being registered at the RM > > > >>>>>>>>>>> > > > >>>>>>>>>>> Honestly I'm not educated enough within Flink to give an > > > >> example > > > >>> to > > > >>>>>>>>>>> such scenario. > > > >>>>>>>>>>> Until now I thought JM defines tasks to be done and TM just > > > >>> blindly > > > >>>>>>>>>>> connects to external systems and does the processing. > > > >>>>>>>>>>> All in all if external systems can be touched when JM + TM > > > >>>>>>>>>>> collaboration happens then we need to consider that in the > > > >>> design. > > > >>>>>>>>>>> Since I don't have an example scenario I don't know what > > > >> exactly > > > >>>>>>> needs > > > >>>>>>>>>>> to be solved. > > > >>>>>>>>>>> I think we need an example case to decide whether we face a > > > >> real > > > >>>>>>> issue > > > >>>>>>>>>>> or the design is not leaking. > > > >>>>>>>>>>> > > > >>>>>>>>>>> > > > >>>>>>>>>>> On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler < > > > >>>>> ches...@apache.org> > > > >>>>>>>>>>> wrote: > > > >>>>>>>>>>> > > > >>>>>>>>>>> > Just to learn something new. I think local recovery > > is > > > >>>>> clear to > > > >>>>>>>>>>> me which is not touching external systems like Kafka > or > > > so > > > >>>>>>>>>>> (correct me if I'm wrong). Is it possible that such > > case > > > >> the > > > >>>>> user > > > >>>>>>>>>>> code just starts to run blindly w/o JM coordination > and > > > >>>>> connects > > > >>>>>>>>>>> to external systems to do data processing? > > > >>>>>>>>>>> > > > >>>>>>>>>>> Local recovery itself shouldn't touch external > systems; > > > >> the > > > >>> TM > > > >>>>>>>>>>> cannot just run user-code without the JobMaster being > > > >>>>> involved, > > > >>>>>>>>>>> but it can happen that the JobMaster+TM collaborate > to > > > run > > > >>>>> stuff > > > >>>>>>>>>>> without the TM being registered at the RM. > > > >>>>>>>>>>> > > > >>>>>>>>>>> On 03/02/2022 13:48, Gabor Somogyi wrote: > > > >>>>>>>>>>>> > Any error in loading the provider (be it by > accident > > > or > > > >>>>>>>>>>>> explicit checks) then is a setup error and we can > fail > > > >> the > > > >>>>>>>>>> cluster. > > > >>>>>>>>>>>> Fail fast is a good direction in my view. In Spark I > > > >> wanted > > > >>>>> to > > > >>>>>>> go > > > >>>>>>>>>>>> to this direction but there were other opinions so > > there > > > >>> if a > > > >>>>>>>>>>>> provider is not loaded then the workload goes > further. > > > >>>>>>>>>>>> Of course the processing will fail if the token is > > > >>> missing... > > > >>>>>>>>>>>> > Requiring HBase (and Hadoop for that matter) to be > > on > > > >> the > > > >>>>> JM > > > >>>>>>>>>>>> system classpath would be a bit unfortunate. Have > you > > > >>>>> considered > > > >>>>>>>>>>>> loading the providers as plugins? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Even if it's unfortunate the actual implementation > is > > > >>>>> depending > > > >>>>>>>>>>>> on that already. Moving HBase and/or all token > > providers > > > >>> into > > > >>>>>>>>>>>> plugins is a possibility. > > > >>>>>>>>>>>> That way if one wants to use a specific provider > then > > a > > > >>>>> plugin > > > >>>>>>>>>>>> need to be added. If we would like to go to this > > > >> direction > > > >>> I > > > >>>>>>>>>>>> would do that in a separate > > > >>>>>>>>>>>> FLIP not to have feature creep here. The actual FLIP > > > >>> already > > > >>>>>>>>>>>> covers several thousand lines of code changes. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > This is missing from the FLIP. From my experience > > with > > > >>> the > > > >>>>>>>>>>>> metric reporters, having the implementation rely on > > the > > > >>>>>>>>>>>> configuration is really annoying for testing > purposes. > > > >>> That's > > > >>>>>>> why > > > >>>>>>>>>>>> I suggested factories; they can take care of > > extracting > > > >> all > > > >>>>>>>>>>>> parameters that the implementation needs, and then > > pass > > > >>> them > > > >>>>>>>>>>>> nicely via the constructor. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> ServiceLoader provided services must have a norarg > > > >>>>> constructor > > > >>>>>>>>>>>> where no parameters can be passed. > > > >>>>>>>>>>>> As a side note testing delegation token providers is > > > pain > > > >>> in > > > >>>>> the > > > >>>>>>>>>>>> ass and not possible with automated tests without > > > >> creating > > > >>> a > > > >>>>>>>>>>>> fully featured kerberos cluster with KDC, HDFS, > HBase, > > > >>> Kafka, > > > >>>>>>>>>> etc.. > > > >>>>>>>>>>>> We've had several tries in Spark but then gave it up > > > >>> because > > > >>>>> of > > > >>>>>>>>>>>> the complexity and the flakyness of it so I wouldn't > > > care > > > >>>>> much > > > >>>>>>>>>>>> about unit testing. > > > >>>>>>>>>>>> The sad truth is that most of the token providers > can > > be > > > >>>>> tested > > > >>>>>>>>>>>> manually on cluster. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Of course this doesn't mean that the whole code is > not > > > >>>>> intended > > > >>>>>>>>>>>> to be covered with tests. I mean couple of parts can > > be > > > >>>>>>>>>>>> automatically tested but providers are not such. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > This also implies that any fields of the provider > > > >>> wouldn't > > > >>>>>>>>>>>> inherently have to be mutable. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> I think this is not an issue. A provider connects > to a > > > >>>>> service, > > > >>>>>>>>>>>> obtains token(s) and then close the connection and > > never > > > >>> seen > > > >>>>>>> the > > > >>>>>>>>>>>> need of an intermediate state. > > > >>>>>>>>>>>> I've just mentioned the singleton behavior to be > > clear. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > One examples is a TM restart + local recovery, > where > > > >> the > > > >>> TM > > > >>>>>>>>>>>> eagerly offers the previous set of slots to the > > leading > > > >> JM. > > > >>>>>>>>>>>> Just to learn something new. I think local recovery > is > > > >>> clear > > > >>>>> to > > > >>>>>>>>>>>> me which is not touching external systems like Kafka > > or > > > >> so > > > >>>>>>>>>>>> (correct me if I'm wrong). > > > >>>>>>>>>>>> Is it possible that such case the user code just > > starts > > > >> to > > > >>>>> run > > > >>>>>>>>>>>> blindly w/o JM coordination and connects to external > > > >>> systems > > > >>>>> to > > > >>>>>>>>>>>> do data processing? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> On Thu, Feb 3, 2022 at 1:09 PM Chesnay Schepler > > > >>>>>>>>>>>> <ches...@apache.org> wrote: > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 1) > > > >>>>>>>>>>>> The manager certainly shouldn't check for > specific > > > >>>>>>>>>>>> implementations. > > > >>>>>>>>>>>> The problem with classpath-based checks is it > can > > > >>> easily > > > >>>>>>>>>>>> happen that the provider can't be loaded in the > > > first > > > >>>>> place > > > >>>>>>>>>>>> (e.g., if you don't use reflection, which you > > > >> currently > > > >>>>>>> kinda > > > >>>>>>>>>>>> force), and in that case Flink can't tell > whether > > > the > > > >>>>> token > > > >>>>>>>>>>>> is not required or the cluster isn't set up > > > >> correctly. > > > >>>>>>>>>>>> As I see it we shouldn't try to be clever; if > the > > > >> users > > > >>>>>>> wants > > > >>>>>>>>>>>> kerberos, then have him enable the providers. > Any > > > >> error > > > >>>>> in > > > >>>>>>>>>>>> loading the provider (be it by accident or > > explicit > > > >>>>> checks) > > > >>>>>>>>>>>> then is a setup error and we can fail the > cluster. > > > >>>>>>>>>>>> If we still want to auto-detect whether the > > provider > > > >>>>> should > > > >>>>>>>>>>>> be used, note that using factories would make > this > > > >>>>> easier; > > > >>>>>>>>>>>> the factory can check the classpath (not having > > any > > > >>>>> direct > > > >>>>>>>>>>>> dependencies on HBase avoids the case above), > and > > > the > > > >>>>>>>>>>>> provider no longer needs reflection because it > > will > > > >>> only > > > >>>>> be > > > >>>>>>>>>>>> used iff HBase is on the CP. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> Requiring HBase (and Hadoop for that matter) to > be > > > on > > > >>>>> the JM > > > >>>>>>>>>>>> system classpath would be a bit unfortunate. > Have > > > you > > > >>>>>>>>>>>> considered loading the providers as plugins? > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> 2) > DelegationTokenProvider#init method > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> This is missing from the FLIP. From my > experience > > > >> with > > > >>>>> the > > > >>>>>>>>>>>> metric reporters, having the implementation rely > > on > > > >> the > > > >>>>>>>>>>>> configuration is really annoying for testing > > > >> purposes. > > > >>>>>>> That's > > > >>>>>>>>>>>> why I suggested factories; they can take care of > > > >>>>> extracting > > > >>>>>>>>>>>> all parameters that the implementation needs, > and > > > >> then > > > >>>>> pass > > > >>>>>>>>>>>> them nicely via the constructor. This also > implies > > > >> that > > > >>>>> any > > > >>>>>>>>>>>> fields of the provider wouldn't inherently have > to > > > be > > > >>>>>>> mutable. > > > >>>>>>>>>>>> > workloads are not yet running until the > initial > > > >> token > > > >>>>> set > > > >>>>>>>>>>>> is not propagated. > > > >>>>>>>>>>>> > > > >>>>>>>>>>>> This isn't necessarily true. It can happen that > > > tasks > > > >>> are > > > >>>>>>>>>>>> being deployed to the TM without it having > > > registered > > > >>>>> with > > > >>>>>>>>>>>> the RM; there is currently no requirement that a > > TM > > > >>> must > > > >>>>> be > > > >>>>>>>>>>>> registered before it may offer slots / accept > task > > > >>>>>>>>>> submissions. > > > >>>>>>>>>>>> One examples is a TM restart + local recovery, > > where > > > >>> the > > > >>>>> TM > > > >>>>>>>>>>>> eagerly offers the previous set of slots to the > > > >> leading > > > >>>>> JM. > > > >>>>>>>>>>>> On 03/02/2022 12:39, Gabor Somogyi wrote: > > > >>>>>>>>>>>>> Thanks for the quick response! > > > >>>>>>>>>>>>> Appreciate your invested time... > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> G > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On Thu, Feb 3, 2022 at 11:12 AM Chesnay > Schepler > > > >>>>>>>>>>>>> <ches...@apache.org> wrote: > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> Thanks for answering the questions! > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> 1) Does the HBase provider require HBase to > > be > > > >> on > > > >>>>> the > > > >>>>>>>>>>>>> classpath? > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> To be instantiated no, to obtain a token yes. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> If so, then could it even be loaded if > > > Hbase > > > >>> is > > > >>>>> on > > > >>>>>>>>>>>>> the classpath? > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> The provider can be loaded but inside the > > provider > > > >> it > > > >>>>> would > > > >>>>>>>>>>>>> detect whether HBase is on classpath. > > > >>>>>>>>>>>>> Just to be crystal clear here this is the > actual > > > >>>>>>>>>>>>> implementation what I would like to take over > > into > > > >> the > > > >>>>>>>>>> Provider. > > > >>>>>>>>>>>>> Please see: > > > >>>>>>>>>>>>> > > > >> > > > > > > https://github.com/apache/flink/blob/e6210d40491ff28c779b8604e425f01983f8a3d7/flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java#L243-L254 > > > >>>>>>>>>>>>> I've considered to load only the necessary > > > Providers > > > >>> but > > > >>>>>>>>>>>>> that would mean a generic Manager need to know > > that > > > >> if > > > >>>>> the > > > >>>>>>>>>>>>> newly loaded Provider is > > > >>>>>>>>>>>>> instanceof HBaseDelegationTokenProvider, then > it > > > >> need > > > >>>>> to be > > > >>>>>>>>>>>>> skipped. > > > >>>>>>>>>>>>> I think it would add unnecessary complexity to > > the > > > >>>>> Manager > > > >>>>>>>>>>>>> and it would contain ugly code parts(at least > in > > my > > > >>> view > > > >>>>>>>>>>>>> ugly), like this > > > >>>>>>>>>>>>> if (provider instanceof > > > HBaseDelegationTokenProvider > > > >>> && > > > >>>>>>>>>>>>> hbaseIsNotOnClasspath()) { > > > >>>>>>>>>>>>> // Skip intentionally > > > >>>>>>>>>>>>> } else if (provider instanceof > > > >>>>>>>>>>>>> SomethingElseDelegationTokenProvider && > > > >>>>>>>>>>>>> somethingElseIsNotOnClasspath()) { > > > >>>>>>>>>>>>> // Skip intentionally > > > >>>>>>>>>>>>> } else { > > > >>>>>>>>>>>>> providers.put(provider.serviceName(), > > provider); > > > >>>>>>>>>>>>> } > > > >>>>>>>>>>>>> I think the least code and most clear approach > is > > > to > > > >>>>> load > > > >>>>>>>>>>>>> the providers and decide inside whether > > everything > > > >> is > > > >>>>> given > > > >>>>>>>>>>>>> to obtain a token. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> If not, then you're assuming the > > classpath > > > >> of > > > >>>>> the > > > >>>>>>>>>>>>> JM/TM to be the same, which isn't > necessarily > > > >> true > > > >>>>> (in > > > >>>>>>>>>>>>> general; and also if Hbase is loaded from > the > > > >>>>>>> user-jar). > > > >>>>>>>>>>>>> I'm not assuming that the classpath of JM/TM > must > > > be > > > >>> the > > > >>>>>>>>>>>>> same. If the HBase jar is coming from the > > user-jar > > > >>> then > > > >>>>> the > > > >>>>>>>>>>>>> HBase code is going to use UGI within the JVM > > when > > > >>>>>>>>>>>>> authentication required. > > > >>>>>>>>>>>>> Of course I've not yet tested within Flink but > in > > > >>> Spark > > > >>>>> it > > > >>>>>>>>>>>>> is working fine. > > > >>>>>>>>>>>>> All in all JM/TM classpath may be different but > > on > > > >>> both > > > >>>>>>> side > > > >>>>>>>>>>>>> HBase jar must exists somehow. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> 2) None of the /Providers/ in your PoC get > > > >> access > > > >>> to > > > >>>>>>> the > > > >>>>>>>>>>>>> configuration. Only the /Manager/ is. Note > > that > > > >> I > > > >>> do > > > >>>>>>> not > > > >>>>>>>>>>>>> know whether there is a need for the > > providers > > > >> to > > > >>>>> have > > > >>>>>>>>>>>>> access to the config, as that's very > > > >>> implementation > > > >>>>>>>>>>>>> specific I suppose. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> You're right. Since this is just a POC and I > > don't > > > >>> have > > > >>>>>>>>>>>>> green light I've not put too many effort for a > > > >> proper > > > >>>>>>>>>>>>> self-review. DelegationTokenProvider#init > method > > > >> must > > > >>>>> get > > > >>>>>>>>>>>>> Flink configuration. > > > >>>>>>>>>>>>> The reason behind is that several further > > > >>> configuration > > > >>>>> can > > > >>>>>>>>>>>>> be find out using that. A good example is to > get > > > >>> Hadoop > > > >>>>>>> conf. > > > >>>>>>>>>>>>> The rationale behind is the same just like > > before, > > > >> it > > > >>>>> would > > > >>>>>>>>>>>>> be good to create a generic Manager as > possible. > > > >>>>>>>>>>>>> To be more specific some code must load Hadoop > > conf > > > >>>>> which > > > >>>>>>>>>>>>> could be the Manager or the Provider. > > > >>>>>>>>>>>>> If the manager does that then the generic > Manager > > > >> must > > > >>>>> be > > > >>>>>>>>>>>>> modified all the time when something special > > thing > > > >> is > > > >>>>>>> needed > > > >>>>>>>>>>>>> for a new provider. > > > >>>>>>>>>>>>> This could be super problematic when a custom > > > >> provider > > > >>>>> is > > > >>>>>>>>>>>>> written. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> 10) I'm not sure myself. It could be > > something > > > >> as > > > >>>>>>>>>>>>> trivial as creating some temporary > directory > > in > > > >>>>> HDFS I > > > >>>>>>>>>>>>> suppose. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> I've not found of such task.YARN and K8S are > not > > > >>>>> expecting > > > >>>>>>>>>>>>> such things from executors and workloads are > not > > > yet > > > >>>>>>> running > > > >>>>>>>>>>>>> until the initial token set is not propagated. > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> > > > >>>>>>>>>>>>> On 03/02/2022 10:23, Gabor Somogyi wrote: > > > >>>>>>>>>>>>>> Please see my answers inline. Hope > provided > > > >>>>> satisfying > > > >>>>>>>>>> answers to all > > > >>>>>>>>>>>>>> questions. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> G > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On Thu, Feb 3, 2022 at 9:17 AM Chesnay > > > >> Schepler< > > > >>>>>>>>>> ches...@apache.org> <mailto:ches...@apache.org> wrote: > > > >>>>>>>>>>>>>>> I have a few question that I'd appreciate > > if > > > >> you > > > >>>>>>> could > > > >>>>>>>>>> answer them. > > > >>>>>>>>>>>>>>> 1. How does the Provider know whether > > it > > > >> is > > > >>>>>>>>>> required or not? > > > >>>>>>>>>>>>>>> All registered providers which are > > registered > > > >>>>>>> properly > > > >>>>>>>>>> are going to be > > > >>>>>>>>>>>>>> loaded and asked to obtain tokens. Worth > to > > > >>> mention > > > >>>>>>>>>> every provider > > > >>>>>>>>>>>>>> has the right to decide whether it wants > to > > > >>> obtain > > > >>>>>>>>>> tokens or not (bool > > > >>>>>>>>>>>>>> delegationTokensRequired()). For instance > if > > > >>>>> provider > > > >>>>>>>>>> detects that > > > >>>>>>>>>>>>>> HBase is not on classpath or not > configured > > > >>>>> properly > > > >>>>>>>>>> then no tokens are > > > >>>>>>>>>>>>>> obtained from that specific provider. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> You may ask how a provider is registered. > > Here > > > >> it > > > >>>>> is: > > > >>>>>>>>>>>>>> The provider is on classpath + there is a > > > >>> META-INF > > > >>>>>>> file > > > >>>>>>>>>> which contains the > > > >>>>>>>>>>>>>> name of the provider, for example: > > > >>>>>>>>>>>>>> > > > >> > > > > > > META-INF/services/org.apache.flink.runtime.security.token.DelegationTokenProvider > > > >>>>>>>>>>>>>> < > > > >> > > > > > > https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-b65ee7e64c5d2dfbb683d3569fc3e42f4b5a8052ab83d7ac21de5ab72f428e0b > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-b65ee7e64c5d2dfbb683d3569fc3e42f4b5a8052ab83d7ac21de5ab72f428e0b > > > >>>>>>>>>>>>>>> 1. How does the configuration of > > > Providers > > > >>>>> work > > > >>>>>>>>>> (how do they get > > > >>>>>>>>>>>>>>> access to a configuration)? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Flink configuration is going to be passed > > to > > > >> all > > > >>>>>>>>>> providers. Please see the > > > >>>>>>>>>>>>>> POC here: > > > >>>>>>>>>>>>>> > > > >> > > > > > > https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1 > > > >>>>>>>>>>>>>> Service specific configurations are loaded > > > >>>>> on-the-fly. > > > >>>>>>>>>> For example in HBase > > > >>>>>>>>>>>>>> case it looks for HBase configuration > class > > > >> which > > > >>>>> will > > > >>>>>>>>>> be instantiated > > > >>>>>>>>>>>>>> within the provider. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> 1. How does a user select providers? > > (Is > > > >> it > > > >>>>>>> purely > > > >>>>>>>>>> based on the > > > >>>>>>>>>>>>>>> provider being on the classpath?) > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Providers can be explicitly turned off > with > > > >> the > > > >>>>>>>>>> following config: > > > >>>>>>>>>>>>>> > "security.kerberos.tokens.${name}.enabled". > > > >> I've > > > >>>>> never > > > >>>>>>>>>> seen that 2 > > > >>>>>>>>>>>>>> different implementation would exist for a > > > >>> specific > > > >>>>>>>>>>>>>> external service, but if this edge case > > would > > > >>> exist > > > >>>>>>>>>> then the mentioned > > > >>>>>>>>>>>>>> config need to be added, a new provider > > with a > > > >>>>>>>>>> different name need to be > > > >>>>>>>>>>>>>> implemented and registered. > > > >>>>>>>>>>>>>> All in all we've seen that provider > handling > > > is > > > >>> not > > > >>>>>>>>>> user specific task but > > > >>>>>>>>>>>>>> a cluster admin one. If a specific > provider > > is > > > >>>>> needed > > > >>>>>>>>>> then it's implemented > > > >>>>>>>>>>>>>> once per company, registered once > > > >>>>>>>>>>>>>> to the clusters and then all users may or > > may > > > >> not > > > >>>>> use > > > >>>>>>>>>> the obtained tokens. > > > >>>>>>>>>>>>>> Worth to mention the system will know > which > > > >> token > > > >>>>> need > > > >>>>>>>>>> to be used when HDFS > > > >>>>>>>>>>>>>> is accessed, this part is automatic. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> 1. How can a user override an > existing > > > >>>>> provider? > > > >>>>>>>>>>>>>>> Pease see the previous bulletpoint. > > > >>>>>>>>>>>>>>> 1. What is > > DelegationTokenProvider#name() > > > >>> used > > > >>>>>>> for? > > > >>>>>>>>>>>>>>> By default all providers which are > > registered > > > >>>>>>> properly > > > >>>>>>>>>> (on classpath + > > > >>>>>>>>>>>>>> META-INF entry) are on by default. With > > > >>>>>>>>>>>>>> > "security.kerberos.tokens.${name}.enabled" a > > > >>>>> specific > > > >>>>>>>>>> provider can be > > > >>>>>>>>>>>>>> turned off. > > > >>>>>>>>>>>>>> Additionally I'm intended to use this in > log > > > >>>>> entries > > > >>>>>>>>>> later on for debugging > > > >>>>>>>>>>>>>> purposes. For example "hadoopfs provider > > > >>> obtained 2 > > > >>>>>>>>>> tokens with ID...". > > > >>>>>>>>>>>>>> This would help what and when is happening > > > >>>>>>>>>>>>>> with tokens. The same applies to > TaskManager > > > >>> side: > > > >>>>> "2 > > > >>>>>>>>>> hadoopfs provider > > > >>>>>>>>>>>>>> tokens arrived with ID...". Important to > > note > > > >>> that > > > >>>>> the > > > >>>>>>>>>> secret part will be > > > >>>>>>>>>>>>>> hidden in the mentioned log entries to > keep > > > the > > > >>>>>>>>>>>>>> attach surface low. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> 1. What happens if the names of 2 > > > >> providers > > > >>>>> are > > > >>>>>>>>>> identical? > > > >>>>>>>>>>>>>>> Presume you mean 2 different classes > which > > > >> both > > > >>>>>>>>>> registered and having the > > > >>>>>>>>>>>>>> same logic inside. This case both will be > > > >> loaded > > > >>>>> and > > > >>>>>>>>>> both is going to > > > >>>>>>>>>>>>>> obtain token(s) for the same service. > > > >>>>>>>>>>>>>> Both obtained token(s) are going to be > added > > > to > > > >>> the > > > >>>>>>>>>> UGI. As a result the > > > >>>>>>>>>>>>>> second will overwrite the first but the > > order > > > >> is > > > >>>>> not > > > >>>>>>>>>> defined. Since both > > > >>>>>>>>>>>>>> token(s) are valid no matter which one is > > > >>>>>>>>>>>>>> used then access to the external system > will > > > >>> work. > > > >>>>>>>>>>>>>> When the class names are same then service > > > >> loader > > > >>>>> only > > > >>>>>>>>>> loads a single entry > > > >>>>>>>>>>>>>> because services are singletons. That's > the > > > >>> reason > > > >>>>> why > > > >>>>>>>>>> state inside > > > >>>>>>>>>>>>>> providers are not advised. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> 1. Will we directly load the > provider, > > or > > > >>>>> first > > > >>>>>>>>>> load a factory > > > >>>>>>>>>>>>>>> (usually preferable)? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Intended to load a provider directly by > > DTM. > > > >> We > > > >>>>> can > > > >>>>>>>>>> add an extra layer to > > > >>>>>>>>>>>>>> have factory but after consideration I > came > > to > > > >> a > > > >>>>>>>>>> conclusion that it would > > > >>>>>>>>>>>>>> be and overkill this case. > > > >>>>>>>>>>>>>> Please have a look how it's planned to > load > > > >>>>> providers > > > >>>>>>>>>> now: > > > >> > > > > > > https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-d56a0bc77335ff23c0318f6dec1872e7b19b1a9ef6d10fff8fbaab9aecac94faR54-R81 > > > >>>>>>>>>>>>>>> 1. What is the Credentials class (it > > > would > > > >>>>>>>>>> necessarily have to be a > > > >>>>>>>>>>>>>>> public api as well)? > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Credentials class is coming from Hadoop. > My > > > >> main > > > >>>>>>>>>> intention was not to bind > > > >>>>>>>>>>>>>> the implementation to Hadoop completely. > It > > is > > > >>> not > > > >>>>>>>>>> possible because of the > > > >>>>>>>>>>>>>> following reasons: > > > >>>>>>>>>>>>>> * Several functionalities are must because > > > >> there > > > >>>>> are > > > >>>>>>> no > > > >>>>>>>>>> alternatives, > > > >>>>>>>>>>>>>> including but not limited to login from > > > keytab, > > > >>>>> proper > > > >>>>>>>>>> TGT cache handling, > > > >>>>>>>>>>>>>> passing tokens to Hadoop services like > HDFS, > > > >>> HBase, > > > >>>>>>>>>> Hive, etc. > > > >>>>>>>>>>>>>> * The partial win is that the whole > > delegation > > > >>>>> token > > > >>>>>>>>>> framework is going to > > > >>>>>>>>>>>>>> be initiated if hadoop-common is on > > classpath > > > >>>>> (Hadoop > > > >>>>>>>>>> is optional in core > > > >>>>>>>>>>>>>> libraries) > > > >>>>>>>>>>>>>> The possibility to eliminate Credentials > > from > > > >> API > > > >>>>>>> could > > > >>>>>>>>>> be: > > > >>>>>>>>>>>>>> * to convert Credentials to byte array > forth > > > >> and > > > >>>>> back > > > >>>>>>>>>> while a provider > > > >>>>>>>>>>>>>> gives back token(s): I think this would be > > an > > > >>>>> overkill > > > >>>>>>>>>> and would make the > > > >>>>>>>>>>>>>> API less clear what to give back what > > Manager > > > >>>>>>>>>> understands > > > >>>>>>>>>>>>>> * to re-implement Credentials internal > > > >> structure > > > >>>>> in a > > > >>>>>>>>>> POJO, here the same > > > >>>>>>>>>>>>>> convert forth and back would happen > between > > > >>>>> provider > > > >>>>>>>>>> and manager. I think > > > >>>>>>>>>>>>>> this case would be the re-invent the wheel > > > >>> scenario > > > >>>>>>>>>>>>>>> 1. What does the TaskManager do with > > the > > > >>>>> received > > > >>>>>>>>>> token? > > > >>>>>>>>>>>>>>> Puts the tokens into the > > UserGroupInformation > > > >>>>>>> instance > > > >>>>>>>>>> for the current > > > >>>>>>>>>>>>>> user. Such way Hadoop compatible services > > can > > > >>> pick > > > >>>>> up > > > >>>>>>>>>> the tokens from there > > > >>>>>>>>>>>>>> properly. > > > >>>>>>>>>>>>>> This is an existing pattern inside Spark. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> 1. Is there any functionality in the > > > >>>>> TaskManager > > > >>>>>>>>>> that could require a > > > >>>>>>>>>>>>>>> token on startup (i.e., before > > > registering > > > >>>>> with > > > >>>>>>>>>> the RM)? > > > >>>>>>>>>>>>>>> Never seen such functionality in Spark > and > > > >> after > > > >>>>>>>>>> analysis not seen in > > > >>>>>>>>>>>>>> Flink too. If you have something in mind > > which > > > >>> I've > > > >>>>>>>>>> missed plz help me out. > > > >>>>>>>>>>>>>> > > > >>>>>>>>>>>>>> On 11/01/2022 14:58, Gabor Somogyi wrote: > > > >>>>>>>>>>>>>>> Hi All, > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Hope all of you have enjoyed the holiday > > > >> season. > > > >>>>>>>>>>>>>>> I would like to start the discussion on > > > >>> FLIP-211< > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework > > > >>>>>>>>>>>>>>> which > > > >>>>>>>>>>>>>>> aims to provide a > > > >>>>>>>>>>>>>>> Kerberos delegation token framework that > > > >>>>>>>>>> /obtains/renews/distributes tokens > > > >>>>>>>>>>>>>>> out-of-the-box. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Please be aware that the FLIP wiki area > is > > > not > > > >>>>> fully > > > >>>>>>>>>> done since the > > > >>>>>>>>>>>>>>> discussion may > > > >>>>>>>>>>>>>>> change the feature in major ways. The > > > proposal > > > >>>>> can be > > > >>>>>>>>>> found in a google doc > > > >>>>>>>>>>>>>>> here< > > > >> > > > > > > https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ > > > >>>>>>>>>> < > > > >>>>>>>>>> > > > >> > > > > > > https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ > > > >>>>>>>>>>>>>>> . > > > >>>>>>>>>>>>>>> As the community agrees on the approach > the > > > >>>>> content > > > >>>>>>>>>> will be moved to the > > > >>>>>>>>>>>>>>> wiki page. > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> Feel free to add your thoughts to make > this > > > >>>>> feature > > > >>>>>>>>>> better! > > > >>>>>>>>>>>>>>> BR, > > > >>>>>>>>>>>>>>> G > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>>>>>>>>>>>> > > > >>>>> > > > > > >