Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Till Rohrmann Thu, 03 Feb 2022 15:08:53 -0800

Sorry I didn't want to offend anybody if it was perceived like this. I can
see that me joining very late into the discussion w/o constructive ideas
was not nice. My motivation for asking for the reasoning behind the current
design proposal is primarily the lack of Kerberos knowledge. Moreover, it
happened before that we moved responsibilities into Flink that we regretted
later.


As I've said, I don't have a better idea right now. If we believe that it
is the right thing to make Flink responsible for distributing the tokens
and we don't find a better solution then we'll go for it. I just wanted to
make sure that we don't overlook an alternative solution that might be
easier to maintain in the long run.

Cheers,
Till

On Thu, Feb 3, 2022 at 7:52 PM Gyula Fóra <gyula.f...@gmail.com> wrote:

> Hi Team!
>
> Let's all calm down a little and not let our emotions affect the discussion
> too much.
> There has been a lot of effort spent from all involved parties so this is
> quite understandable :)
>
> Even though not everyone said this explicitly, it seems that everyone more
> or less agrees that a feature implementing token renewal is necessary and
> valuable.
>
> The main point of contention is: where should the token renewal
> logic run and how to get the tokens to wherever needed.
>
> From my perspective the current design is very reasonable at first sight
> because:
>  1. It runs the token renewal in a single place avoiding extra CDC workload
>  2. Does not introduce new processes, extra communication channels etc but
> piggybacks on existing robust mechanisms.
>
> I understand the concerns about adding new things in the resource manager
> but I think that really depends on how we look at it.
> We cannot reasonably expect a custom token renewal process to have it's own
> secure distribution logic like Flink has now, that is a complete overkill.
> This practically means that we will not have a slim efficient
> implementation for this but something unnecessarily complex. And the only
> thing we get in return is a bit less code in the resource manager.
>
> From a logical standpoint the delegation framework needs to run at a
> centralized place and need to be able to access new task manager processes
> to achieve all it's design goals.
> We can drop a single renewer as a design goal but that might be a decision
> that can affect large scale production runs.
>
> Cheers,
> Gyula
>
>
>
>
> On Thu, Feb 3, 2022 at 7:32 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
> > First of, at no point have we questioned the use-case and importance of
> > this feature, and the fact that David, Till and me spent time looking at
> > the FLIP, asking questions, and discussing different aspects of it
> > should make this obvious.
> >
> > I'd appreciate it if you didn't dismiss our replies that quickly.
> >
> >  > Ok, so we declare that users who try to use delegation tokens in
> > Flink is dead end code and not supported, right?
> >
> > No one has said that. Are you claiming that your design is the /only
> > possible implementation/ that is capable of achieving the stated goals,
> > that there are 0 alternatives? On of the *main**points* of these
> > discussion threads is to discover alternative implementations that maybe
> > weren't thought of. Yes, that may imply that we amend your design, or
> > reject it completely and come up with a new one.
> >
> >
> > Let's clarify what (I think) Till proposed to get the imagination juice
> > flowing.
> >
> > At the end of the day, all we need is a way to provide Flink processes
> > with a token that can be periodically updated. _Who_ issues that token
> > is irrelevant for the functionality to work. You are proposing for a new
> > component in the Flink RM to do that; Till is proposing to have some
> > external process do it. *That's it*.
> >
> > How this could look like in practice is fairly straight forwad; add a
> > pluggable interface (aka, your TokenProvider thing) that is loaded in
> > each process, which can _somehow_ provide tokens that are then set in
> > the UserGroupInformation.
> > _How_ the provider receives token is up to the provider. It _may_ just
> > talk directly to Kerberos, or it could use some communication channel to
> > accept tokens from the outside.
> > This would for example make it a lot easier to properly integrate this
> > into the lifecycle of the process, as we'd sidestep the whole "TM is
> > running but still needs a Token" issue; it could become a proper setup
> > step of the process that is independent from other Flink processes.
> >
> > /Discuss/.
> >
> > On 03/02/2022 18:57, Gabor Somogyi wrote:
> > >> And even
> > > if we do it like this, there is no guarantee that it works because
> there
> > > can be other applications bombing the KDC with requests.
> > >
> > > 1. The main issue to solve here is that workloads using delegation
> tokens
> > > are stopping after 7 days with default configuration.
> > > 2. This is not new design, it's rock stable and performing well in
> Spark
> > > for years.
> > >
> > >>  From a
> > > maintainability and separation of concerns perspective I'd rather have
> > this
> > > as some kind of external tool/service that makes KDC scale better and
> > that
> > > Flink processes can talk to to obtain the tokens.
> > >
> > > Ok, so we declare that users who try to use delegation tokens in Flink
> is
> > > dead end code and not supported, right? Then this must be explicitely
> > > written in the security documentation that such users who use that
> > feature
> > > are left behind.
> > >
> > > As I see the discussion turned away from facts and started to speak
> about
> > > feelings. If you have strategic problems with the feature please put
> your
> > > -1 on the vote and we can spare quite some time.
> > >
> > > G
> > >
> > >
> > > On Thu, 3 Feb 2022, 18:34 Till Rohrmann,<trohrm...@apache.org>  wrote:
> > >
> > >> I don't have a good alternative solution but it sounds to me a bit as
> > if we
> > >> are trying to solve Kerberos' scalability problems within Flink. And
> > even
> > >> if we do it like this, there is no guarantee that it works because
> there
> > >> can be other applications bombing the KDC with requests. From a
> > >> maintainability and separation of concerns perspective I'd rather have
> > this
> > >> as some kind of external tool/service that makes KDC scale better and
> > that
> > >> Flink processes can talk to to obtain the tokens.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Feb 3, 2022 at 6:01 PM Gabor Somogyi<
> gabor.g.somo...@gmail.com>
> > >> wrote:
> > >>
> > >>> Oh and the most important reason I've forgotten.
> > >>> Without the feature in the FLIP all secure workloads with delegation
> > >> tokens
> > >>> are going to stop when tokens are reaching it's max lifetime 🙂
> > >>> This is around 7 days with default config...
> > >>>
> > >>> On Thu, Feb 3, 2022 at 5:30 PM Gabor Somogyi<
> gabor.g.somo...@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>>> That's not the single purpose of the feature but in some
> environments
> > >> it
> > >>>> caused problems.
> > >>>> The main intention is not to deploy keytab to all the nodes because
> > the
> > >>>> attack surface is bigger + reduce the KDC load.
> > >>>> I've already described the situation previously in this thread so
> > >> copying
> > >>>> it here.
> > >>>>
> > >>>> --------COPY--------
> > >>>> "KDC *may* collapse under some circumstances" is the proper wording.
> > >>>>
> > >>>> We have several customers who are executing workloads on
> Spark/Flink.
> > >>> Most
> > >>>> of the time I'm facing their
> > >>>> daily issues which is heavily environment and use-case dependent.
> I've
> > >>>> seen various cases:
> > >>>> * where the mentioned ~1k nodes were working fine
> > >>>> * where KDC thought the number of requests are coming from DDOS
> attack
> > >> so
> > >>>> discontinued authentication
> > >>>> * where KDC was simply not responding because of the load
> > >>>> * where KDC was intermittently had some outage (this was the most
> > nasty
> > >>>> thing)
> > >>>>
> > >>>> Since you're managing relatively big cluster then you know that KDC
> is
> > >>> not
> > >>>> only used by Spark/Flink workloads
> > >>>> but the whole company IT infrastructure is bombing it so it really
> > >>> depends
> > >>>> on other factors too whether KDC is reaching
> > >>>> it's limit or not. Not sure what kind of evidence are you looking
> for
> > >> but
> > >>>> I'm not authorized to share any information about
> > >>>> our clients data.
> > >>>>
> > >>>> One thing is for sure. The more external system types are used in
> > >>>> workloads (for ex. HDFS, HBase, Hive, Kafka) which
> > >>>> are authenticating through KDC the more possibility to reach this
> > >>>> threshold when the cluster is big enough.
> > >>>> --------COPY--------
> > >>>>
> > >>>>> The FLIP mentions scaling issues with 200 nodes; it's really
> > >> surprising
> > >>>> to me that such a small number of requests can already cause issues.
> > >>>>
> > >>>> One node/task doesn't mean 1 request. The following type of kerberos
> > >> auth
> > >>>> types has been seen by me which can run at the same time:
> > >>>> HDFS, Hbase, Hive, Kafka, all DBs (oracle, mariaDB, etc...)
> > >> Additionally
> > >>>> one task is not necessarily opens 1 connection.
> > >>>>
> > >>>> All in all I don't have steps to reproduce but we've faced this
> > >>> already...
> > >>>> G
> > >>>>
> > >>>>
> > >>>> On Thu, Feb 3, 2022 at 5:15 PM Chesnay Schepler<ches...@apache.org>
> > >>>> wrote:
> > >>>>
> > >>>>> What I don't understand is how this could overload the KDC. Aren't
> > >>>>> tokens valid for a relatively long time period?
> > >>>>>
> > >>>>> For new deployments where many TMs are started at once I could
> > imagine
> > >>>>> it temporarily, but shouldn't the accesses to the KDC eventually
> > >>>>> naturally spread out?
> > >>>>>
> > >>>>> The FLIP mentions scaling issues with 200 nodes; it's really
> > >> surprising
> > >>>>> to me that such a small number of requests can already cause
> issues.
> > >>>>>
> > >>>>> On 03/02/2022 16:14, Gabor Somogyi wrote:
> > >>>>>>> I would prefer not choosing the first option
> > >>>>>> Then the second option may play only.
> > >>>>>>
> > >>>>>>> I am not a Kerberos expert but is it really so that every
> > >> application
> > >>>>> that
> > >>>>>> wants to use Kerberos needs to implement the token propagation
> > >> itself?
> > >>>>> This
> > >>>>>> somehow feels as if there is something missing.
> > >>>>>>
> > >>>>>> OK, so first some kerberos + token intro.
> > >>>>>>
> > >>>>>> Some basics:
> > >>>>>> * TGT can be created from keytab
> > >>>>>> * TGT is needed to obtain TGS (called token)
> > >>>>>> * Authentication only works with TGS -> all places where external
> > >>>>> system is
> > >>>>>> needed either a TGT or TGS needed
> > >>>>>>
> > >>>>>> There are basically 2 ways to authenticate to a kerberos secured
> > >>>>> external
> > >>>>>> system:
> > >>>>>> 1. One needs a kerberos TGT which MUST be propagated to all JVMs.
> > >> Here
> > >>>>> each
> > >>>>>> and every JVM obtains a TGS by itself which bombs the KDC that may
> > >>>>> collapse.
> > >>>>>> 2. One needs a kerberos TGT which exists only on a single place
> (in
> > >>> this
> > >>>>>> case JM). JM gets a TGS which MUST be propagated to all TMs
> because
> > >>>>>> otherwise authentication fails.
> > >>>>>>
> > >>>>>> Now the whole system works in a way that keytab file (we can
> imagine
> > >>>>> that
> > >>>>>> as plaintext password) is reachable on all nodes.
> > >>>>>> This is a relatively huge attack surface. Now the main intention
> is:
> > >>>>>> * Instead of propagating keytab file to all nodes propagate a TGS
> > >>> which
> > >>>>> has
> > >>>>>> limited lifetime (more secure)
> > >>>>>> * Do the TGS generation in a single place so KDC may not collapse
> +
> > >>>>> having
> > >>>>>> keytab only on a single node can be better protected
> > >>>>>>
> > >>>>>> As a final conclusion if there is a place which expects to do
> > >> kerberos
> > >>>>>> authentication then it's a MUST to have either TGT or TGS.
> > >>>>>> Now it's done in a pretty unsecure way. The questions are the
> > >>> following:
> > >>>>>> * Do we want to leave this unsecure keytab propagation like this
> and
> > >>>>> bomb
> > >>>>>> KDC?
> > >>>>>> * If no then how do we propagate the more secure token to TMs.
> > >>>>>>
> > >>>>>> If the answer to the first question is no then the FLIP can be
> > >>> abandoned
> > >>>>>> and doesn't worth the further effort.
> > >>>>>> If the answer is yes then we can talk about the how part.
> > >>>>>>
> > >>>>>> G
> > >>>>>>
> > >>>>>>
> > >>>>>> On Thu, Feb 3, 2022 at 3:42 PM Till Rohrmann<trohrm...@apache.org
> >
> > >>>>> wrote:
> > >>>>>>> I would prefer not choosing the first option
> > >>>>>>>
> > >>>>>>>> Make the TM accept tasks only after registration(not sure if
> it's
> > >>>>>>> possible or makes sense at all)
> > >>>>>>>
> > >>>>>>> because it effectively means that we change how Flink's component
> > >>>>> lifecycle
> > >>>>>>> works for distributing Kerberos tokens. It also effectively means
> > >>> that
> > >>>>> a TM
> > >>>>>>> cannot make progress until connected to a RM.
> > >>>>>>>
> > >>>>>>> I am not a Kerberos expert but is it really so that every
> > >> application
> > >>>>> that
> > >>>>>>> wants to use Kerberos needs to implement the token propagation
> > >>> itself?
> > >>>>> This
> > >>>>>>> somehow feels as if there is something missing.
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> Till
> > >>>>>>>
> > >>>>>>> On Thu, Feb 3, 2022 at 3:29 PM Gabor Somogyi <
> > >>>>> gabor.g.somo...@gmail.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>>>    Isn't this something the underlying resource management
> system
> > >>>>> could
> > >>>>>>> do
> > >>>>>>>> or which every process could do on its own?
> > >>>>>>>>
> > >>>>>>>> I was looking for such feature but not found.
> > >>>>>>>> Maybe we can solve the propagation easier but then I'm waiting
> on
> > >>>>> better
> > >>>>>>>> suggestion.
> > >>>>>>>> If anybody has better/more simple idea then please point to a
> > >>> specific
> > >>>>>>>> feature which works on all resource management systems.
> > >>>>>>>>
> > >>>>>>>>> Here's an example for the TM to run workloads without being
> > >>> connected
> > >>>>>>>> to the RM, without ever having a valid token
> > >>>>>>>>
> > >>>>>>>> All in all I see the main problem. Not sure what is the reason
> > >>> behind
> > >>>>>>> that
> > >>>>>>>> a TM accepts tasks w/o registration but clearly not helping
> here.
> > >>>>>>>> I basically see 2 possible solutions:
> > >>>>>>>> * Make the TM accept tasks only after registration(not sure if
> > >> it's
> > >>>>>>>> possible or makes sense at all)
> > >>>>>>>> * We send tokens right after container creation with
> > >>>>>>>> "updateDelegationTokens"
> > >>>>>>>> Not sure which one is more realistic to do since I'm not
> involved
> > >>> the
> > >>>>> new
> > >>>>>>>> feature.
> > >>>>>>>> WDYT?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Thu, Feb 3, 2022 at 3:09 PM Till Rohrmann <
> > >> trohrm...@apache.org>
> > >>>>>>> wrote:
> > >>>>>>>>> Hi everyone,
> > >>>>>>>>>
> > >>>>>>>>> Sorry for joining this discussion late. I also did not read all
> > >>>>>>> responses
> > >>>>>>>>> in this thread so my question might already be answered: Why
> does
> > >>>>> Flink
> > >>>>>>>>> need to be involved in the propagation of the tokens? Why do we
> > >>> need
> > >>>>>>>>> explicit RPC calls in the Flink domain? Isn't this something
> the
> > >>>>>>> underlying
> > >>>>>>>>> resource management system could do or which every process
> could
> > >> do
> > >>>>> on
> > >>>>>>> its
> > >>>>>>>>> own? I am a bit worried that we are making Flink responsible
> for
> > >>>>>>> something
> > >>>>>>>>> that it is not really designed to do so.
> > >>>>>>>>>
> > >>>>>>>>> Cheers,
> > >>>>>>>>> Till
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Feb 3, 2022 at 2:54 PM Chesnay Schepler <
> > >>> ches...@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Here's an example for the TM to run workloads without being
> > >>>>> connected
> > >>>>>>> to
> > >>>>>>>>>> the RM, while potentially having a valid token:
> > >>>>>>>>>>
> > >>>>>>>>>>    1. TM registers at RM
> > >>>>>>>>>>    2. JobMaster requests slot from RM -> TM gets notified
> > >>>>>>>>>>    3. JM fails over
> > >>>>>>>>>>    4. TM re-offers the slot to the failed over JobMaster
> > >>>>>>>>>>    5. TM reconnects to RM at some point
> > >>>>>>>>>>
> > >>>>>>>>>> Here's an example for the TM to run workloads without being
> > >>>>> connected
> > >>>>>>> to
> > >>>>>>>>>> the RM, without ever having a valid token:
> > >>>>>>>>>>
> > >>>>>>>>>>    1. TM1 has a valid token and is running some tasks.
> > >>>>>>>>>>    2. TM1 crashes
> > >>>>>>>>>>    3. TM2 is started to take over, and re-uses the working
> > >>> directory
> > >>>>> of
> > >>>>>>>>>>       TM1 (new feature in 1.15!)
> > >>>>>>>>>>    4. TM2 recovers the previous slot allocations
> > >>>>>>>>>>    5. TM2 is informed about leading JM
> > >>>>>>>>>>    6. TM2 starts registration with RM
> > >>>>>>>>>>    7. TM2 offers slots to JobMaster
> > >>>>>>>>>>    8. TM2 accepts task submission from JobMaster
> > >>>>>>>>>>    9. ...some time later the registration completes...
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 03/02/2022 14:24, Gabor Somogyi wrote:
> > >>>>>>>>>>>> but it can happen that the JobMaster+TM collaborate to run
> > >> stuff
> > >>>>>>>>>>> without the TM being registered at the RM
> > >>>>>>>>>>>
> > >>>>>>>>>>> Honestly I'm not educated enough within Flink to give an
> > >> example
> > >>> to
> > >>>>>>>>>>> such scenario.
> > >>>>>>>>>>> Until now I thought JM defines tasks to be done and TM just
> > >>> blindly
> > >>>>>>>>>>> connects to external systems and does the processing.
> > >>>>>>>>>>> All in all if external systems can be touched when JM + TM
> > >>>>>>>>>>> collaboration happens then we need to consider that in the
> > >>> design.
> > >>>>>>>>>>> Since I don't have an example scenario I don't know what
> > >> exactly
> > >>>>>>> needs
> > >>>>>>>>>>> to be solved.
> > >>>>>>>>>>> I think we need an example case to decide whether we face a
> > >> real
> > >>>>>>> issue
> > >>>>>>>>>>> or the design is not leaking.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Feb 3, 2022 at 2:12 PM Chesnay Schepler <
> > >>>>> ches...@apache.org>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>       > Just to learn something new. I think local recovery
> is
> > >>>>> clear to
> > >>>>>>>>>>>       me which is not touching external systems like Kafka or
> > so
> > >>>>>>>>>>>       (correct me if I'm wrong). Is it possible that such
> case
> > >> the
> > >>>>> user
> > >>>>>>>>>>>       code just starts to run blindly w/o JM coordination and
> > >>>>> connects
> > >>>>>>>>>>>       to external systems to do data processing?
> > >>>>>>>>>>>
> > >>>>>>>>>>>       Local recovery itself shouldn't touch external systems;
> > >> the
> > >>> TM
> > >>>>>>>>>>>       cannot just run user-code without the JobMaster being
> > >>>>> involved,
> > >>>>>>>>>>>       but it can happen that the JobMaster+TM collaborate to
> > run
> > >>>>> stuff
> > >>>>>>>>>>>       without the TM being registered at the RM.
> > >>>>>>>>>>>
> > >>>>>>>>>>>       On 03/02/2022 13:48, Gabor Somogyi wrote:
> > >>>>>>>>>>>>       > Any error in loading the provider (be it by accident
> > or
> > >>>>>>>>>>>>       explicit checks) then is a setup error and we can fail
> > >> the
> > >>>>>>>>>> cluster.
> > >>>>>>>>>>>>       Fail fast is a good direction in my view. In Spark I
> > >> wanted
> > >>>>> to
> > >>>>>>> go
> > >>>>>>>>>>>>       to this direction but there were other opinions so
> there
> > >>> if a
> > >>>>>>>>>>>>       provider is not loaded then the workload goes further.
> > >>>>>>>>>>>>       Of course the processing will fail if the token is
> > >>> missing...
> > >>>>>>>>>>>>       > Requiring HBase (and Hadoop for that matter) to be
> on
> > >> the
> > >>>>> JM
> > >>>>>>>>>>>>       system classpath would be a bit unfortunate. Have you
> > >>>>> considered
> > >>>>>>>>>>>>       loading the providers as plugins?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       Even if it's unfortunate the actual implementation is
> > >>>>> depending
> > >>>>>>>>>>>>       on that already. Moving HBase and/or all token
> providers
> > >>> into
> > >>>>>>>>>>>>       plugins is a possibility.
> > >>>>>>>>>>>>       That way if one wants to use a specific provider then
> a
> > >>>>> plugin
> > >>>>>>>>>>>>       need to be added. If we would like to go to this
> > >> direction
> > >>> I
> > >>>>>>>>>>>>       would do that in a separate
> > >>>>>>>>>>>>       FLIP not to have feature creep here. The actual FLIP
> > >>> already
> > >>>>>>>>>>>>       covers several thousand lines of code changes.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       > This is missing from the FLIP. From my experience
> with
> > >>> the
> > >>>>>>>>>>>>       metric reporters, having the implementation rely on
> the
> > >>>>>>>>>>>>       configuration is really annoying for testing purposes.
> > >>> That's
> > >>>>>>> why
> > >>>>>>>>>>>>       I suggested factories; they can take care of
> extracting
> > >> all
> > >>>>>>>>>>>>       parameters that the implementation needs, and then
> pass
> > >>> them
> > >>>>>>>>>>>>       nicely via the constructor.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       ServiceLoader provided services must have a norarg
> > >>>>> constructor
> > >>>>>>>>>>>>       where no parameters can be passed.
> > >>>>>>>>>>>>       As a side note testing delegation token providers is
> > pain
> > >>> in
> > >>>>> the
> > >>>>>>>>>>>>       ass and not possible with automated tests without
> > >> creating
> > >>> a
> > >>>>>>>>>>>>       fully featured kerberos cluster with KDC, HDFS, HBase,
> > >>> Kafka,
> > >>>>>>>>>> etc..
> > >>>>>>>>>>>>       We've had several tries in Spark but then gave it up
> > >>> because
> > >>>>> of
> > >>>>>>>>>>>>       the complexity and the flakyness of it so I wouldn't
> > care
> > >>>>> much
> > >>>>>>>>>>>>       about unit testing.
> > >>>>>>>>>>>>       The sad truth is that most of the token providers can
> be
> > >>>>> tested
> > >>>>>>>>>>>>       manually on cluster.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       Of course this doesn't mean that the whole code is not
> > >>>>> intended
> > >>>>>>>>>>>>       to be covered with tests. I mean couple of parts can
> be
> > >>>>>>>>>>>>       automatically tested but providers are not such.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       > This also implies that any fields of the provider
> > >>> wouldn't
> > >>>>>>>>>>>>       inherently have to be mutable.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       I think this is not an issue. A provider connects to a
> > >>>>> service,
> > >>>>>>>>>>>>       obtains token(s) and then close the connection and
> never
> > >>> seen
> > >>>>>>> the
> > >>>>>>>>>>>>       need of an intermediate state.
> > >>>>>>>>>>>>       I've just mentioned the singleton behavior to be
> clear.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       > One examples is a TM restart + local recovery, where
> > >> the
> > >>> TM
> > >>>>>>>>>>>>       eagerly offers the previous set of slots to the
> leading
> > >> JM.
> > >>>>>>>>>>>>       Just to learn something new. I think local recovery is
> > >>> clear
> > >>>>> to
> > >>>>>>>>>>>>       me which is not touching external systems like Kafka
> or
> > >> so
> > >>>>>>>>>>>>       (correct me if I'm wrong).
> > >>>>>>>>>>>>       Is it possible that such case the user code just
> starts
> > >> to
> > >>>>> run
> > >>>>>>>>>>>>       blindly w/o JM coordination and connects to external
> > >>> systems
> > >>>>> to
> > >>>>>>>>>>>>       do data processing?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>       On Thu, Feb 3, 2022 at 1:09 PM Chesnay Schepler
> > >>>>>>>>>>>>       <ches...@apache.org>  wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>           1)
> > >>>>>>>>>>>>           The manager certainly shouldn't check for specific
> > >>>>>>>>>>>>           implementations.
> > >>>>>>>>>>>>           The problem with classpath-based checks is it can
> > >>> easily
> > >>>>>>>>>>>>           happen that the provider can't be loaded in the
> > first
> > >>>>> place
> > >>>>>>>>>>>>           (e.g., if you don't use reflection, which you
> > >> currently
> > >>>>>>> kinda
> > >>>>>>>>>>>>           force), and in that case Flink can't tell whether
> > the
> > >>>>> token
> > >>>>>>>>>>>>           is not required or the cluster isn't set up
> > >> correctly.
> > >>>>>>>>>>>>           As I see it we shouldn't try to be clever; if the
> > >> users
> > >>>>>>> wants
> > >>>>>>>>>>>>           kerberos, then have him enable the providers. Any
> > >> error
> > >>>>> in
> > >>>>>>>>>>>>           loading the provider (be it by accident or
> explicit
> > >>>>> checks)
> > >>>>>>>>>>>>           then is a setup error and we can fail the cluster.
> > >>>>>>>>>>>>           If we still want to auto-detect whether the
> provider
> > >>>>> should
> > >>>>>>>>>>>>           be used, note that using factories would make this
> > >>>>> easier;
> > >>>>>>>>>>>>           the factory can check the classpath (not having
> any
> > >>>>> direct
> > >>>>>>>>>>>>           dependencies on HBase avoids the case above), and
> > the
> > >>>>>>>>>>>>           provider no longer needs reflection because it
> will
> > >>> only
> > >>>>> be
> > >>>>>>>>>>>>           used iff HBase is on the CP.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>           Requiring HBase (and Hadoop for that matter) to be
> > on
> > >>>>> the JM
> > >>>>>>>>>>>>           system classpath would be a bit unfortunate. Have
> > you
> > >>>>>>>>>>>>           considered loading the providers as plugins?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>           2) > DelegationTokenProvider#init method
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>           This is missing from the FLIP. From my experience
> > >> with
> > >>>>> the
> > >>>>>>>>>>>>           metric reporters, having the implementation rely
> on
> > >> the
> > >>>>>>>>>>>>           configuration is really annoying for testing
> > >> purposes.
> > >>>>>>> That's
> > >>>>>>>>>>>>           why I suggested factories; they can take care of
> > >>>>> extracting
> > >>>>>>>>>>>>           all parameters that the implementation needs, and
> > >> then
> > >>>>> pass
> > >>>>>>>>>>>>           them nicely via the constructor. This also implies
> > >> that
> > >>>>> any
> > >>>>>>>>>>>>           fields of the provider wouldn't inherently have to
> > be
> > >>>>>>> mutable.
> > >>>>>>>>>>>>           > workloads are not yet running until the initial
> > >> token
> > >>>>> set
> > >>>>>>>>>>>>           is not propagated.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>           This isn't necessarily true. It can happen that
> > tasks
> > >>> are
> > >>>>>>>>>>>>           being deployed to the TM without it having
> > registered
> > >>>>> with
> > >>>>>>>>>>>>           the RM; there is currently no requirement that a
> TM
> > >>> must
> > >>>>> be
> > >>>>>>>>>>>>           registered before it may offer slots / accept task
> > >>>>>>>>>> submissions.
> > >>>>>>>>>>>>           One examples is a TM restart + local recovery,
> where
> > >>> the
> > >>>>> TM
> > >>>>>>>>>>>>           eagerly offers the previous set of slots to the
> > >> leading
> > >>>>> JM.
> > >>>>>>>>>>>>           On 03/02/2022 12:39, Gabor Somogyi wrote:
> > >>>>>>>>>>>>>           Thanks for the quick response!
> > >>>>>>>>>>>>>           Appreciate your invested time...
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>           G
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>           On Thu, Feb 3, 2022 at 11:12 AM Chesnay Schepler
> > >>>>>>>>>>>>>           <ches...@apache.org>  wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>               Thanks for answering the questions!
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>               1) Does the HBase provider require HBase to
> be
> > >> on
> > >>>>> the
> > >>>>>>>>>>>>>               classpath?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>           To be instantiated no, to obtain a token yes.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>                   If so, then could it even be loaded if
> > Hbase
> > >>> is
> > >>>>> on
> > >>>>>>>>>>>>>               the classpath?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>           The provider can be loaded but inside the
> provider
> > >> it
> > >>>>> would
> > >>>>>>>>>>>>>           detect whether HBase is on classpath.
> > >>>>>>>>>>>>>           Just to be crystal clear here this is the actual
> > >>>>>>>>>>>>>           implementation what I would like to take over
> into
> > >> the
> > >>>>>>>>>> Provider.
> > >>>>>>>>>>>>>           Please see:
> > >>>>>>>>>>>>>
> > >>
> >
> https://github.com/apache/flink/blob/e6210d40491ff28c779b8604e425f01983f8a3d7/flink-yarn/src/main/java/org/apache/flink/yarn/Utils.java#L243-L254
> > >>>>>>>>>>>>>           I've considered to load only the necessary
> > Providers
> > >>> but
> > >>>>>>>>>>>>>           that would mean a generic Manager need to know
> that
> > >> if
> > >>>>> the
> > >>>>>>>>>>>>>           newly loaded Provider is
> > >>>>>>>>>>>>>           instanceof HBaseDelegationTokenProvider, then it
> > >> need
> > >>>>> to be
> > >>>>>>>>>>>>>           skipped.
> > >>>>>>>>>>>>>           I think it would add unnecessary complexity to
> the
> > >>>>> Manager
> > >>>>>>>>>>>>>           and it would contain ugly code parts(at least in
> my
> > >>> view
> > >>>>>>>>>>>>>           ugly), like this
> > >>>>>>>>>>>>>           if (provider instanceof
> > HBaseDelegationTokenProvider
> > >>> &&
> > >>>>>>>>>>>>>           hbaseIsNotOnClasspath()) {
> > >>>>>>>>>>>>>             // Skip intentionally
> > >>>>>>>>>>>>>           } else if (provider instanceof
> > >>>>>>>>>>>>>           SomethingElseDelegationTokenProvider &&
> > >>>>>>>>>>>>>           somethingElseIsNotOnClasspath()) {
> > >>>>>>>>>>>>>             // Skip intentionally
> > >>>>>>>>>>>>>           } else {
> > >>>>>>>>>>>>>             providers.put(provider.serviceName(),
> provider);
> > >>>>>>>>>>>>>           }
> > >>>>>>>>>>>>>           I think the least code and most clear approach is
> > to
> > >>>>> load
> > >>>>>>>>>>>>>           the providers and decide inside whether
> everything
> > >> is
> > >>>>> given
> > >>>>>>>>>>>>>           to obtain a token.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>                   If not, then you're assuming the
> classpath
> > >> of
> > >>>>> the
> > >>>>>>>>>>>>>               JM/TM to be the same, which isn't necessarily
> > >> true
> > >>>>> (in
> > >>>>>>>>>>>>>               general; and also if Hbase is loaded from the
> > >>>>>>> user-jar).
> > >>>>>>>>>>>>>           I'm not assuming that the classpath of JM/TM must
> > be
> > >>> the
> > >>>>>>>>>>>>>           same. If the HBase jar is coming from the
> user-jar
> > >>> then
> > >>>>> the
> > >>>>>>>>>>>>>           HBase code is going to use UGI within the JVM
> when
> > >>>>>>>>>>>>>           authentication required.
> > >>>>>>>>>>>>>           Of course I've not yet tested within Flink but in
> > >>> Spark
> > >>>>> it
> > >>>>>>>>>>>>>           is working fine.
> > >>>>>>>>>>>>>           All in all JM/TM classpath may be different but
> on
> > >>> both
> > >>>>>>> side
> > >>>>>>>>>>>>>           HBase jar must exists somehow.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>               2) None of the /Providers/ in your PoC get
> > >> access
> > >>> to
> > >>>>>>> the
> > >>>>>>>>>>>>>               configuration. Only the /Manager/ is. Note
> that
> > >> I
> > >>> do
> > >>>>>>> not
> > >>>>>>>>>>>>>               know whether there is a need for the
> providers
> > >> to
> > >>>>> have
> > >>>>>>>>>>>>>               access to the config, as that's very
> > >>> implementation
> > >>>>>>>>>>>>>               specific I suppose.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>           You're right. Since this is just a POC and I
> don't
> > >>> have
> > >>>>>>>>>>>>>           green light I've not put too many effort for a
> > >> proper
> > >>>>>>>>>>>>>           self-review. DelegationTokenProvider#init method
> > >> must
> > >>>>> get
> > >>>>>>>>>>>>>           Flink configuration.
> > >>>>>>>>>>>>>           The reason behind is that several further
> > >>> configuration
> > >>>>> can
> > >>>>>>>>>>>>>           be find out using that. A good example is to get
> > >>> Hadoop
> > >>>>>>> conf.
> > >>>>>>>>>>>>>           The rationale behind is the same just like
> before,
> > >> it
> > >>>>> would
> > >>>>>>>>>>>>>           be good to create a generic Manager as possible.
> > >>>>>>>>>>>>>           To be more specific some code must load Hadoop
> conf
> > >>>>> which
> > >>>>>>>>>>>>>           could be the Manager or the Provider.
> > >>>>>>>>>>>>>           If the manager does that then the generic Manager
> > >> must
> > >>>>> be
> > >>>>>>>>>>>>>           modified all the time when something special
> thing
> > >> is
> > >>>>>>> needed
> > >>>>>>>>>>>>>           for a new provider.
> > >>>>>>>>>>>>>           This could be super problematic when a custom
> > >> provider
> > >>>>> is
> > >>>>>>>>>>>>>           written.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>               10) I'm not sure myself. It could be
> something
> > >> as
> > >>>>>>>>>>>>>               trivial as creating some temporary directory
> in
> > >>>>> HDFS I
> > >>>>>>>>>>>>>               suppose.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>           I've not found of such task.YARN and K8S are not
> > >>>>> expecting
> > >>>>>>>>>>>>>           such things from executors and workloads are not
> > yet
> > >>>>>>> running
> > >>>>>>>>>>>>>           until the initial token set is not propagated.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>               On 03/02/2022 10:23, Gabor Somogyi wrote:
> > >>>>>>>>>>>>>>               Please see my answers inline. Hope provided
> > >>>>> satisfying
> > >>>>>>>>>> answers to all
> > >>>>>>>>>>>>>>               questions.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>               G
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>               On Thu, Feb 3, 2022 at 9:17 AM Chesnay
> > >> Schepler<
> > >>>>>>>>>> ches...@apache.org>  <mailto:ches...@apache.org>   wrote:
> > >>>>>>>>>>>>>>>               I have a few question that I'd appreciate
> if
> > >> you
> > >>>>>>> could
> > >>>>>>>>>> answer them.
> > >>>>>>>>>>>>>>>                   1. How does the Provider know whether
> it
> > >> is
> > >>>>>>>>>> required or not?
> > >>>>>>>>>>>>>>>               All registered providers which are
> registered
> > >>>>>>> properly
> > >>>>>>>>>> are going to be
> > >>>>>>>>>>>>>>               loaded and asked to obtain tokens. Worth to
> > >>> mention
> > >>>>>>>>>> every provider
> > >>>>>>>>>>>>>>               has the right to decide whether it wants to
> > >>> obtain
> > >>>>>>>>>> tokens or not (bool
> > >>>>>>>>>>>>>>               delegationTokensRequired()). For instance if
> > >>>>> provider
> > >>>>>>>>>> detects that
> > >>>>>>>>>>>>>>               HBase is not on classpath or not configured
> > >>>>> properly
> > >>>>>>>>>> then no tokens are
> > >>>>>>>>>>>>>>               obtained from that specific provider.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>               You may ask how a provider is registered.
> Here
> > >> it
> > >>>>> is:
> > >>>>>>>>>>>>>>               The provider is on classpath + there is a
> > >>> META-INF
> > >>>>>>> file
> > >>>>>>>>>> which contains the
> > >>>>>>>>>>>>>>               name of the provider, for example:
> > >>>>>>>>>>>>>>
> > >>
> >
> META-INF/services/org.apache.flink.runtime.security.token.DelegationTokenProvider
> > >>>>>>>>>>>>>>               <
> > >>
> >
> https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-b65ee7e64c5d2dfbb683d3569fc3e42f4b5a8052ab83d7ac21de5ab72f428e0b
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-b65ee7e64c5d2dfbb683d3569fc3e42f4b5a8052ab83d7ac21de5ab72f428e0b
> > >>>>>>>>>>>>>>>                   1. How does the configuration of
> > Providers
> > >>>>> work
> > >>>>>>>>>> (how do they get
> > >>>>>>>>>>>>>>>                   access to a configuration)?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Flink configuration is going to be passed
> to
> > >> all
> > >>>>>>>>>> providers. Please see the
> > >>>>>>>>>>>>>>               POC here:
> > >>>>>>>>>>>>>>
> > >>
> >
> https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1
> > >>>>>>>>>>>>>>               Service specific configurations are loaded
> > >>>>> on-the-fly.
> > >>>>>>>>>> For example in HBase
> > >>>>>>>>>>>>>>               case it looks for HBase configuration class
> > >> which
> > >>>>> will
> > >>>>>>>>>> be instantiated
> > >>>>>>>>>>>>>>               within the provider.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                   1. How does a user select providers?
> (Is
> > >> it
> > >>>>>>> purely
> > >>>>>>>>>> based on the
> > >>>>>>>>>>>>>>>                   provider being on the classpath?)
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Providers can be explicitly turned off with
> > >> the
> > >>>>>>>>>> following config:
> > >>>>>>>>>>>>>>               "security.kerberos.tokens.${name}.enabled".
> > >> I've
> > >>>>> never
> > >>>>>>>>>> seen that 2
> > >>>>>>>>>>>>>>               different implementation would exist for a
> > >>> specific
> > >>>>>>>>>>>>>>               external service, but if this edge case
> would
> > >>> exist
> > >>>>>>>>>> then the mentioned
> > >>>>>>>>>>>>>>               config need to be added, a new provider
> with a
> > >>>>>>>>>> different name need to be
> > >>>>>>>>>>>>>>               implemented and registered.
> > >>>>>>>>>>>>>>               All in all we've seen that provider handling
> > is
> > >>> not
> > >>>>>>>>>> user specific task but
> > >>>>>>>>>>>>>>               a cluster admin one. If a specific provider
> is
> > >>>>> needed
> > >>>>>>>>>> then it's implemented
> > >>>>>>>>>>>>>>               once per company, registered once
> > >>>>>>>>>>>>>>               to the clusters and then all users may or
> may
> > >> not
> > >>>>> use
> > >>>>>>>>>> the obtained tokens.
> > >>>>>>>>>>>>>>               Worth to mention the system will know which
> > >> token
> > >>>>> need
> > >>>>>>>>>> to be used when HDFS
> > >>>>>>>>>>>>>>               is accessed, this part is automatic.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                   1. How can a user override an existing
> > >>>>> provider?
> > >>>>>>>>>>>>>>>               Pease see the previous bulletpoint.
> > >>>>>>>>>>>>>>>                   1. What is
> DelegationTokenProvider#name()
> > >>> used
> > >>>>>>> for?
> > >>>>>>>>>>>>>>>               By default all providers which are
> registered
> > >>>>>>> properly
> > >>>>>>>>>> (on classpath +
> > >>>>>>>>>>>>>>               META-INF entry) are on by default. With
> > >>>>>>>>>>>>>>               "security.kerberos.tokens.${name}.enabled" a
> > >>>>> specific
> > >>>>>>>>>> provider can be
> > >>>>>>>>>>>>>>               turned off.
> > >>>>>>>>>>>>>>               Additionally I'm intended to use this in log
> > >>>>> entries
> > >>>>>>>>>> later on for debugging
> > >>>>>>>>>>>>>>               purposes. For example "hadoopfs provider
> > >>> obtained 2
> > >>>>>>>>>> tokens with ID...".
> > >>>>>>>>>>>>>>               This would help what and when is happening
> > >>>>>>>>>>>>>>               with tokens. The same applies to TaskManager
> > >>> side:
> > >>>>> "2
> > >>>>>>>>>> hadoopfs provider
> > >>>>>>>>>>>>>>               tokens arrived with ID...". Important to
> note
> > >>> that
> > >>>>> the
> > >>>>>>>>>> secret part will be
> > >>>>>>>>>>>>>>               hidden in the mentioned log entries to keep
> > the
> > >>>>>>>>>>>>>>               attach surface low.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                   1. What happens if the names of 2
> > >> providers
> > >>>>> are
> > >>>>>>>>>> identical?
> > >>>>>>>>>>>>>>>               Presume you mean 2 different classes which
> > >> both
> > >>>>>>>>>> registered and having the
> > >>>>>>>>>>>>>>               same logic inside. This case both will be
> > >> loaded
> > >>>>> and
> > >>>>>>>>>> both is going to
> > >>>>>>>>>>>>>>               obtain token(s) for the same service.
> > >>>>>>>>>>>>>>               Both obtained token(s) are going to be added
> > to
> > >>> the
> > >>>>>>>>>> UGI. As a result the
> > >>>>>>>>>>>>>>               second will overwrite the first but the
> order
> > >> is
> > >>>>> not
> > >>>>>>>>>> defined. Since both
> > >>>>>>>>>>>>>>               token(s) are valid no matter which one is
> > >>>>>>>>>>>>>>               used then access to the external system will
> > >>> work.
> > >>>>>>>>>>>>>>               When the class names are same then service
> > >> loader
> > >>>>> only
> > >>>>>>>>>> loads a single entry
> > >>>>>>>>>>>>>>               because services are singletons. That's the
> > >>> reason
> > >>>>> why
> > >>>>>>>>>> state inside
> > >>>>>>>>>>>>>>               providers are not advised.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                   1. Will we directly load the provider,
> or
> > >>>>> first
> > >>>>>>>>>> load a factory
> > >>>>>>>>>>>>>>>                   (usually preferable)?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Intended to load a provider directly by
> DTM.
> > >> We
> > >>>>> can
> > >>>>>>>>>> add an extra layer to
> > >>>>>>>>>>>>>>               have factory but after consideration I came
> to
> > >> a
> > >>>>>>>>>> conclusion that it would
> > >>>>>>>>>>>>>>               be and overkill this case.
> > >>>>>>>>>>>>>>               Please have a look how it's planned to load
> > >>>>> providers
> > >>>>>>>>>> now:
> > >>
> >
> https://github.com/apache/flink/compare/master...gaborgsomogyi:dt?expand=1#diff-d56a0bc77335ff23c0318f6dec1872e7b19b1a9ef6d10fff8fbaab9aecac94faR54-R81
> > >>>>>>>>>>>>>>>                   1. What is the Credentials class (it
> > would
> > >>>>>>>>>> necessarily have to be a
> > >>>>>>>>>>>>>>>                   public api as well)?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Credentials class is coming from Hadoop. My
> > >> main
> > >>>>>>>>>> intention was not to bind
> > >>>>>>>>>>>>>>               the implementation to Hadoop completely. It
> is
> > >>> not
> > >>>>>>>>>> possible because of the
> > >>>>>>>>>>>>>>               following reasons:
> > >>>>>>>>>>>>>>               * Several functionalities are must because
> > >> there
> > >>>>> are
> > >>>>>>> no
> > >>>>>>>>>> alternatives,
> > >>>>>>>>>>>>>>               including but not limited to login from
> > keytab,
> > >>>>> proper
> > >>>>>>>>>> TGT cache handling,
> > >>>>>>>>>>>>>>               passing tokens to Hadoop services like HDFS,
> > >>> HBase,
> > >>>>>>>>>> Hive, etc.
> > >>>>>>>>>>>>>>               * The partial win is that the whole
> delegation
> > >>>>> token
> > >>>>>>>>>> framework is going to
> > >>>>>>>>>>>>>>               be initiated if hadoop-common is on
> classpath
> > >>>>> (Hadoop
> > >>>>>>>>>> is optional in core
> > >>>>>>>>>>>>>>               libraries)
> > >>>>>>>>>>>>>>               The possibility to eliminate Credentials
> from
> > >> API
> > >>>>>>> could
> > >>>>>>>>>> be:
> > >>>>>>>>>>>>>>               * to convert Credentials to byte array forth
> > >> and
> > >>>>> back
> > >>>>>>>>>> while a provider
> > >>>>>>>>>>>>>>               gives back token(s): I think this would be
> an
> > >>>>> overkill
> > >>>>>>>>>> and would make the
> > >>>>>>>>>>>>>>               API less clear what to give back what
> Manager
> > >>>>>>>>>> understands
> > >>>>>>>>>>>>>>               * to re-implement Credentials internal
> > >> structure
> > >>>>> in a
> > >>>>>>>>>> POJO, here the same
> > >>>>>>>>>>>>>>               convert forth and back would happen between
> > >>>>> provider
> > >>>>>>>>>> and manager. I think
> > >>>>>>>>>>>>>>               this case would be the re-invent the wheel
> > >>> scenario
> > >>>>>>>>>>>>>>>                   1. What does the TaskManager do with
> the
> > >>>>> received
> > >>>>>>>>>> token?
> > >>>>>>>>>>>>>>>               Puts the tokens into the
> UserGroupInformation
> > >>>>>>> instance
> > >>>>>>>>>> for the current
> > >>>>>>>>>>>>>>               user. Such way Hadoop compatible services
> can
> > >>> pick
> > >>>>> up
> > >>>>>>>>>> the tokens from there
> > >>>>>>>>>>>>>>               properly.
> > >>>>>>>>>>>>>>               This is an existing pattern inside Spark.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                   1. Is there any functionality in the
> > >>>>> TaskManager
> > >>>>>>>>>> that could require a
> > >>>>>>>>>>>>>>>                   token on startup (i.e., before
> > registering
> > >>>>> with
> > >>>>>>>>>> the RM)?
> > >>>>>>>>>>>>>>>               Never seen such functionality in Spark and
> > >> after
> > >>>>>>>>>> analysis not seen in
> > >>>>>>>>>>>>>>               Flink too. If you have something in mind
> which
> > >>> I've
> > >>>>>>>>>> missed plz help me out.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>               On 11/01/2022 14:58, Gabor Somogyi wrote:
> > >>>>>>>>>>>>>>>               Hi All,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Hope all of you have enjoyed the holiday
> > >> season.
> > >>>>>>>>>>>>>>>               I would like to start the discussion on
> > >>> FLIP-211<
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-211%3A+Kerberos+delegation+token+framework
> > >>>>>>>>>>>>>>>               which
> > >>>>>>>>>>>>>>>               aims to provide a
> > >>>>>>>>>>>>>>>               Kerberos delegation token framework that
> > >>>>>>>>>> /obtains/renews/distributes tokens
> > >>>>>>>>>>>>>>>               out-of-the-box.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Please be aware that the FLIP wiki area is
> > not
> > >>>>> fully
> > >>>>>>>>>> done since the
> > >>>>>>>>>>>>>>>               discussion may
> > >>>>>>>>>>>>>>>               change the feature in major ways. The
> > proposal
> > >>>>> can be
> > >>>>>>>>>> found in a google doc
> > >>>>>>>>>>>>>>>               here<
> > >>
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > >>>>>>>>>> <
> > >>>>>>>>>>
> > >>
> >
> https://docs.google.com/document/d/1JzMbQ1pCJsLVz8yHrCxroYMRP2GwGwvacLrGyaIx5Yc/edit?fbclid=IwAR0vfeJvAbEUSzHQAAJfnWTaX46L6o7LyXhMfBUCcPrNi-uXNgoOaI8PMDQ
> > >>>>>>>>>>>>>>>               .
> > >>>>>>>>>>>>>>>               As the community agrees on the approach the
> > >>>>> content
> > >>>>>>>>>> will be moved to the
> > >>>>>>>>>>>>>>>               wiki page.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>               Feel free to add your thoughts to make this
> > >>>>> feature
> > >>>>>>>>>> better!
> > >>>>>>>>>>>>>>>               BR,
> > >>>>>>>>>>>>>>>               G
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>
> >
>

Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

Reply via email to