Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Rajini Sivaram Wed, 12 Sep 2018 10:36:46 -0700

Hi Ron,

Thank you for summarising, I think it covers the differences between the
two approaches well.


A few minor points to answer the questions in there:

1) When re-authetication is initiated in the Selector during poll(), we can
move an idle channel to re-authentication state. It is similar to injecting
requests, but achieved by changing channel back to authenticating state.

3) To clarify why I think re-authentication should fit in with our
authentication design: My point was not about a specific connection being
usable or not usable. It was about what happens at the client API level.
Our client API (producer/consumer/admin client etc.) currently assume that
a single broker authentication failure is a fatal error that is never
retried because we assume that broker only ever fails an authentication
request if credentials are invalid. If we ever decide to support cases
where broker occasionally fails an authentication request due to a
transient failure, we need to do more around how we handle authentication
failures in clients. We may decide that it is ok to close the connection
for authentication and not for re-authentication as you mentioned, but we
need to change the way this disconnection is handled by clients. So IMO, we
should either add support for transient retriable authentication failures
properly or not retry for any scenario. Personally, I don't think we would
want to retry all authentication failures even if it is a
re-authentication, I think we could (at some point in future), allow
brokers to return an error code that indicates that it is a transient
broker-side failure rather than invalid credentials and handle the error
differently. I see no reason at that point why we wouldn't handle
authentication and re-authentication in the same way.

4) As you said, the high-level approach would be bigger than the low-level
approach in terms of LOC. But I wouldn't be too concerned about lines of
code. My bigger concern was about modularity. Our security code is already
complex, protocols like Kerberos and SSL that we use from the JRE make
problem diagnosis hard. Async I/O makes the networking code complex. You
need to understand networking layer to work with the security layer, but
the rest of the code base doesn't rely on knowledge of network/security
layers. My main concern about the high-level approach is that it spans
these boundaries, making it harder to maintain in the long run.


On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski <
[email protected]> wrote:

> Hi Ron, Rajini
>
> Thanks for summarizing the discussion so far, Ron!
>
> 1) How often do we have such long-lived connection idleness (e.g 5-10
> minutes) in practice?
>
> 3) I agree that retries for re-authentication are useful.
>
> 4) The interleaving of requests sounds like a great feature to have, but
> the tradeoff against code complexity is a tough one. I would personally go
> with the simpler approach since you could always add interleaving on top if
> the community decides the latency should be better.
>
> Best,
> Stanislav
>
> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <[email protected]> wrote:
>
> > Hi everyone.  I've updated the PR to reflect the latest conclusions from
> > this ongoing discussion.  The KIP still needs the suggested updates; I
> will
> > do that later this week.  II agree with Rajini that some additional
> > feedback from the community at large would be very helpful at this point
> in
> > time.
> >
> > Here's where we stand.
> >
> > We have 2 separate implementations for re-authenticating existing
> > connections -- a so-called "low-level" approach and a (relatively
> speaking)
> > "high-level" approach -- and we agree on how they should be compared.
> > Specifically, Rajini has provided a mechanism that works at a relatively
> > low level in the stack by intercepting network sends and queueing them up
> > while re-authentication happens; then the queued sends are sent after
> > re-authentication succeeds, or the connection is closed if
> > re-authentication fails.  See the prototype commit at
> >
> > https://github.com/rajinisivaram/kafka/commit/b9d711907ad843
> c11d17e80d6743bfb1d4e3f3fd
> > .
> >
> > I have an implementation that works higher up in the stack; it injects
> > authentication requests into the KafkaClient via a new method added to
> that
> > interface, and the implementation (i.e. NetworkClient) sends them when
> > poll() is called.  The updated PR is available at
> > https://github.com/apache/kafka/pull/5582/.
> >
> > Here is how we think these two approaches should be compared.
> >
> > 1) *When re-authentication starts*.  The low-level approach initiates
> > re-authentication only if/when the connection is actually used, so it may
> > start after the existing credential expires; the current PR
> implementation
> > injects re-authentication requests into the existing flow, and
> > re-authentication starts immediately regardless of whether or not the
> > connection is being used for something else.  Rajini believes the
> low-level
> > approach can be adjusted to re-authenticate idle connections (Rajini, I
> > don't see how this can be done without injecting requests, which is what
> > the high-level connection is doing, but I am probably missing something
> --
> > no need to go into it unless it is easy to explain.)
> >
> > 2) *When requests not related to re-authentication are able to use the
> > re-authenticating connection*.  The low-level approach finishes
> > re-authentication completely before allowing anything else to traverse
> the
> > connection, and we agree that this is how the low-level implementation
> must
> > work without significant work/changes.  The current PR implementation
> > interleaves re-authentication requests with the existing flow in the
> > asynchronous I/O uses cases (Consumer, Producer, Admin Client, etc.), and
> > it allows the two synchronous use cases (ControllerChannelManager and
> > ReplicaFetcherBlockingSend/ReplicaFetcherThread) to decide which style
> they
> > want.  I (somewhat arbitrarily, to prove out the flexibility of the
> > high-level approach) coded ControllerChannelManager to do complete,
> > non-interleaved re-authentication and ReplicaFetcherBlockingSend/
> > ReplicaFetcherThread to interleave requests.  The approach has impacts on
> > the maximum size of latency spikes: interleaving requests can decrease
> the
> > latency.  The benefit of interleaving is tough to evaluate because it
> isn't
> > clear what the latency requirement really is.  Note that
> re-authentication
> > can entail several network round-trips between the client and the broker.
> > Comments in this area would be especially appreciated.
> >
> > 3) *What happens if re-authentication fails (i.e. retry capability)*.  I
> > think this is where we have not yet settled on what the requirement is
> > (even more so than the issue of potential latency mitigation
> > requirements).  Rajini, you had mentioned that re-authentication should
> > work the same way as authentication, but I see the two situations as
> being
> > asymmetric.  In the authentication case, if authentication fails, the
> > connection was never fully established and could not be used, and it is
> > closed -- TLS negotiation may have taken place, so that effort is lost,
> but
> > beyond that nothing else is lost.  In the re-authentication case the
> > connection is fully established, it is in use, and in fact it can remain
> in
> > use for some time going forward as long as the existing credential
> remains
> > unexpired; abandoning it at this point due to a re-authentication failure
> > (which can occur due to no fault of the client -- i.e. a remote directory
> > or OAuth server being temporarily down) abandons a fully functioning
> > connection with remaining lifetime.  Am I thinking about this reasonably?
> > I still tend to believe that retries of failed re-authentications are a
> > requirement.
> >
> > 4) *Size*. The low-level approach is smaller in terms of code.  It is a
> > prototype, so it doesn't have everything the high-level approach has,
> but I
> > have split the PR for the high-level approach into two separate commits
> to
> > facilitate a more apples-to-apples comparison: the first commit contains
> > infrastructure that would have to be added to the low-level prototype if
> we
> > went with that, and the second commit has everything specific to the
> > high-level approach.  So comparing the low-level prototype commit (~350
> > changes) to the second commit in the high-level PR (~1400 changes) is
> > reasonable. The high-level approach at this point is about 4 times as
> big,
> > though it does have lots of error/sanity checking and functionality (such
> > as retry) that would to at least some degree have to be added to the
> > low-level implementation.  Still, the high-level approach is (and will
> > likely remain) somewhat bigger than the low-level approach.
> >
> > I hope I presented this in a balanced and accurate way; Rajini, please
> > correct me if I got anything wrong or left out anything important.
> > Apologies in advance if that is the case -- it is unintended.
> >
> > Again, comments/discussion from the wider community at this juncture
> would
> > be especially appreciated.
> >
> > Ron
> >
> >
> >
> >
> > On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram <[email protected]>
> > wrote:
> >
> > > Hi Ron,
> > >
> > > Thank you for the analysis. Yes, I agree with the differences you have
> > > pointed out between the two approaches.
> > >
> > >    1. Re-authenticating idle connections (or connections nearing idle
> > >    timeout): With the lower-level approach, there is nothing stopping
> us
> > > from
> > >    re-authenticating idle connections. We could either consider idle
> > >    connections for re-authentication or we could implement code in
> broker
> > > to
> > >    kill connections that deals with delays introduced by idleness.
> Since
> > we
> > >    expect token lifetimes to be much higher than connection expiry
> time,
> > >    either would work. And the second approach is probably better.
> > >    2. Interleaving authentication requests and application requests to
> > >    reduce impact on latency: It is not impossible to interleave with
> the
> > >    low-level approach, but it adds complexity and reduces the value of
> > this
> > >    approach. So it will be good to understand the requirements in terms
> > of
> > >    latency. In terms of comparing the two approaches, it is fair to
> > compare
> > >    non-interleaved low-level with the naturally interleaved high-level
> > >    approach.
> > >    3. Retries: I don't think requirement for retries is different
> between
> > >    authentication and re-authentication from a client's point of view.
> We
> > >    should either support retries for both or not at all. We currently
> > > process
> > >    authentication failures as fatal errors. We expect that if you are
> > using
> > >    external authentication servers, those servers will be highly
> > available.
> > >    Our built-in authentication mechanisms avoid accessing external
> > servers
> > > in
> > >    the authentication path (during authentication, verification is
> > > performed
> > >    using private/public keys or cache lookup). This is necessary in our
> > > design
> > >    since authentications are performed synchronously on the network
> > thread.
> > >
> > > One more point with the lower-level mechanism: If one day we decided
> that
> > > we want to extend this for SSL, it would be possible to change just the
> > > transport layer and make it work for SSL. We may never want to do this,
> > but
> > > thought I would point out anyway.
> > >
> > > I think the next step would be to add the low-level approach to the KIP
> > > under `Rejected Alternatives` with details on the differences. And more
> > > detail on requirements covering latency and retries in the body of the
> > > KIP.  For "Rejected Alternatives", can we add sub-sections (I think
> there
> > > are two approaches there now, but you need to read through to figure
> that
> > > out, so sub-titles would help).
> > >
> > > Once the KIP is updated, it will be good to get more feedback from the
> > > community to decide on the approach to adopt.
> > >
> > >
> > > On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino <[email protected]>
> wrote:
> > >
> > > > Hi yet again, Rajini.  If we decide that interleaving is impossible
> > with
> > > > the low-level approach but we still want to at least support the
> > > > possibility given that it reduces the size of latency spikes, then I
> do
> > > > have a suggestion.  I documented in the KIP how I rejected a single,
> > > > one-size-fits-all queue approach because we don't know when poll()
> will
> > > be
> > > > called in the synchronous I/O use case.  An advantage of such an
> > approach
> > > > -- if it would have worked, which it didn't -- is that it would have
> > been
> > > > transparent to the owner of the NetworkClient instance (this is one
> of
> > > the
> > > > big advantages of the low-level approach, after all).  But we could
> > make
> > > > the one-size-fits-all queue approach work if we are willing to impose
> > > some
> > > > requirements on synchronous I/O users of NetworkClient.
> Specifically,
> > we
> > > > could add a method to the org.apache.kafka.clients.KafkaClient
> > interface
> > > > (which NetworkClient implements) as follows:
> > > >
> > > >     /**
> > > >
> > > >      * Return true if any node has re-authentication requests waiting
> > to
> > > be
> > > > sent. A
> > > >
> > > >      * call to {@link #poll(long, long)} is required to send such
> > > requests.
> > > > An owner
> > > >
> > > >      * of this instance that does not implement a run loop to
> > repeatedly
> > > > call
> > > >
> > > >      * {@link #poll(long, long)} but instead only sends requests
> > > > synchronously
> > > >
> > > >      * on-demand must call this method periodically -- and invoke
> > > >
> > > >      * {@link #poll(long, long)} if the return value is {@code true}
> --
> > > to
> > > > ensure
> > > >
> > > >      * that any re-authentication requests that have ben injected are
> > > sent
> > > > in a
> > > >
> > > >      * timely fashion.
> > > >
> > > >      *
> > > >
> > > >      * @return true if any node has re-authentication requests
> waiting
> > to
> > > > be sent
> > > >
> > > >      */
> > > >
> > > >     default boolean hasReauthenticationRequests() {
> > > >
> > > >         return false;
> > > >
> > > >     }
> > > >
> > > > If we are willing to add an additional method like this and make the
> > > minor
> > > > adjustments to the code associated with the few synchronous use cases
> > > then
> > > > I think the high-level approach will work.  We would then have the
> > > > possibility of further parameterizing the various synchronous I/O use
> > > > cases.  For example, there are currently 3:
> > > >
> > > > ControllerChannelManager
> > > > KafkaProducer, via Sender
> > > > ReplicaFetcherBlockingSend, via ReplicaFetcherThread
> > > >
> > > > Is it conceivable that one of these use cases might be more
> > > > latency-sensitive than others and might desire interleaving?  A
> > > high-level
> > > > approach would give us the option of configuring each use case
> > > accordingly.
> > > >
> > > > Ron
> > > >
> > > > On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino <[email protected]>
> > wrote:
> > > >
> > > > > Hi again, Rajini.  It occurs to me that from a *behavior*
> perspective
> > > > > there are really 3 fundamental differences between the low-level
> > > approach
> > > > > you provided and the high-level approach as it currently exists in
> > the
> > > > PR:
> > > > >
> > > > > 1) *When re-authentication starts*.  The low-level approach
> initiates
> > > > > re-authentication only if/when the connection is actually used, so
> it
> > > may
> > > > > start after the existing credential expires; the current PR
> > > > implementation
> > > > > injects re-authentication requests into the existing flow, and
> > > > > re-authentication starts immediately regardless of whether or not
> the
> > > > > connection is being used for something else.
> > > > >
> > > > > 2) *When requests not related to re-authentication can use the
> > > > > re-authenticating connection*.  The low-level approach finishes
> > > > > re-authentication completely before allowing anything else to
> travers
> > > the
> > > > > connection; the current PR implementation interleaves
> > re-authentication
> > > > > requests with existing flow requests.
> > > > >
> > > > > 3) *What happens if re-authentication fails*.  The low-level
> approach
> > > > > results in a closed connection and does not support retries -- at
> > least
> > > > as
> > > > > currently implemented; the current PR implementation supports
> > retries.
> > > > >
> > > > > Do you agree that these are the (only?) behavioral differences?
> > > > >
> > > > > For these facets of behavior, I wonder what the requirements are
> for
> > > this
> > > > > feature.  I think they are as follows:
> > > > >
> > > > > 1) *When re-authentication starts*: I don't think we have a hard
> > > > > requirement here when considered in isolation -- whether
> > > > re-authentication
> > > > > starts immediately or is delayed until the connection is used
> > probably
> > > > > doesn't matter.
> > > > >
> > > > > 2) *When requests not related to re-authentication can use the
> > > > > re-authenticating connection*: there is a tradeoff here between
> > latency
> > > > > and ability to debug re-authentication problems.  Delaying use of
> the
> > > > > connection until re-authentication finishes results in bigger
> latency
> > > > > spikes but keeps re-authentication requests somewhat together;
> > > > interleaving
> > > > > minimizes the size of individual latency spikes but adds some
> > > separation
> > > > > between the requests.
> > > > >
> > > > > 3) *What happens if re-authentication fails*: I believe we have a
> > clear
> > > > > requirement for retry capability given what I have written
> > previously.
> > > > Do
> > > > > you agree?  Note that while the current low-level implementation
> does
> > > not
> > > > > support retry, I have been thinking about how that can be done,
> and I
> > > am
> > > > > pretty sure it can be.  We can keep the old authenticators on the
> > > client
> > > > > and server sides and put them back into place if the
> > re-authentication
> > > > > fails; we would also have to make sure the server side doesn't
> delay
> > > any
> > > > > failed re-authentication result and also doesn't close the
> connection
> > > > upon
> > > > > re-authentication failure.  I think I see how to do all of that.
> > > > >
> > > > > There are some interactions between the above requirements.  If
> > > > > re-authentication can't start immediately and has to wait for the
> > > > > connection to be used then that precludes interleaving because we
> > can't
> > > > be
> > > > > sure that the credential will be active by the time it is used --
> if
> > it
> > > > > isn't active, and we allow interleaving, then requests not related
> to
> > > > > re-authentication will fail if server-side
> > > > > connection-close-due-to-expired-credential functionality is in
> place.
> > > > Also
> > > > > note that any such server-side connection-close-due-to-
> > > > expired-credential
> > > > > functionality would likely have to avoid closing a connection until
> > it
> > > is
> > > > > used for anything other than re-authentication -- it must allow
> > > > > re-authentication requests through when the credential is expired.
> > > > >
> > > > > Given all of the above, it feels to me like the low-level solution
> is
> > > > > viable only under the following conditions:
> > > > >
> > > > > 1) We must accept a delay in when re-authentication occurs to when
> > the
> > > > > connection is used
> > > > > 2) We must accept the bigger latency spikes associated with not
> > > > > interleaving requests
> > > > >
> > > > > Does this sound right to you, and if so, do you find these
> conditions
> > > > > acceptable?  Or have I missed something and/or made incorrect
> > > assumptions
> > > > > somewhere?
> > > > >
> > > > > Ron
> > > > >
> > > > >
> > > > > On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino <[email protected]>
> > > wrote:
> > > > >
> > > > >> Hi Rajini.  The code really helped me to understand what you were
> > > > >> thinking -- thanks.  I'm still digesting, but here are some quick
> > > > >> observations:
> > > > >>
> > > > >> Your approach (I'll call it the "low-level" approach, as compared
> to
> > > the
> > > > >> existing PR, which works at a higher level of abstraction) -- the
> > > > low-level
> > > > >> approach is certainly intriguing.  The smaller code size is
> welcome,
> > > of
> > > > >> course, as is the fact that re-authentication simply works for
> > > everyone
> > > > >> regardless of the style of use (async vs. sync I/O).
> > > > >>
> > > > >> I did notice that re-authentication of the connection starts only
> > > > if/when
> > > > >> the client uses the connection.  For example, if I run a console
> > > > producer,
> > > > >> re-authentication does not happen unless I try to produce
> something.
> > > On
> > > > >> the one hand this is potentially good -- if the client isn't using
> > the
> > > > >> connection then the connection stays "silent" and could be closed
> on
> > > the
> > > > >> server side if it stays idle long enough -- whereas if we start
> > > > injecting
> > > > >> re-authentication requests (as is done in the high-level approach)
> > > then
> > > > the
> > > > >> connection never goes completely silent and could (?) potentially
> > > avoid
> > > > >> being closed due to idleness.
> > > > >>
> > > > >> However, if we implement sever-side kill of connections using
> > expired
> > > > >> credentials (which we agree is needed), then we might end up with
> > the
> > > > >> broker killing connections that are sitting idle for only a short
> > > > period of
> > > > >> time.  For example, if we refresh the token on the client side and
> > > tell
> > > > the
> > > > >> connection that it is eligible to be re-authenticated, then it is
> > > > >> conceivable that the connection might be sitting idle at that
> point
> > > and
> > > > >> might not be used until after the token it is currently using
> > expires.
> > > > The
> > > > >> server might kill the connection, and that would force the client
> to
> > > > >> re-connect with a new connection (requiring TLS negotiation). The
> > > > >> probability of this happening increases as the token lifetime
> > > > decreases, of
> > > > >> course, and it can be offset by decreasing the window factor (i.e.
> > > make
> > > > it
> > > > >> eligible for re-authenticate at 50% of the lifetime rather than
> 80%,
> > > for
> > > > >> example -- it would have to sit idle for longer before the server
> > > tried
> > > > to
> > > > >> kill it).  We haven't implemented server-side kill yet, so maybe
> we
> > > > could
> > > > >> make it intelligent and only kill the connection if it is used
> (for
> > > > >> anything except re-authentication) after the expiration time...
> > > > >>
> > > > >> I also wonder about the ability to add retry into the low-level
> > > > >> approach.  Do you think it would be possible?  It doesn't seem
> like
> > it
> > > > to
> > > > >> me -- at least not without some big changes that would eliminate
> > some
> > > of
> > > > >> the advantage of being able to reuse the existing authentication
> > code.
> > > > The
> > > > >> reason I ask is because I think retry is necessary.  It is part of
> > how
> > > > >> refreshes work for both GSSAPI and OAUTHBEARER -- they refresh
> based
> > > on
> > > > >> some window factor (i.e. 80%) and implement periodic retry upon
> > > failure
> > > > so
> > > > >> that they can maximize the chances of having a new credential
> > > available
> > > > for
> > > > >> any new connection attempt.  Without refresh we could end up in
> the
> > > > >> situation where the connection still has some lifetime left (10%,
> > 20%,
> > > > or
> > > > >> whatever) but it tries to re-authenticate and cannot through no
> > fault
> > > of
> > > > >> its own (i.e. token endpoint down, some Kerberos failure, etc.)
> -=-
> > > the
> > > > >> connection is closed at that point, and it is then unable to
> > reconnect
> > > > >> because of the same temporary problem.  We could end up with an
> > > > especially
> > > > >> ill-timed, temporary outage in some non-Kafka system (related to
> > OAuth
> > > > or
> > > > >> Kerberos, or some LDAP directory) causing all clients to be kicked
> > off
> > > > the
> > > > >> cluster.  Retry capability seems to me to be the way to mitigate
> > this
> > > > risk.
> > > > >>
> > > > >> Anyway, that's it for now.  I really like the approach you
> outlined
> > --
> > > > at
> > > > >> least at this point based on my current understanding.  I will
> > > continue
> > > > to
> > > > >> dig in, and I may send more comments/questions.  But for now, I
> > think
> > > > the
> > > > >> lack of retry -- and my definitely-could-be-wrong sense that it
> > can't
> > > > >> easily be added -- is my biggest concern with this low-level
> > approach.
> > > > >>
> > > > >> Ron
> > > > >>
> > > > >> On Thu, Sep 6, 2018 at 4:57 PM Rajini Sivaram <
> > > [email protected]>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Ron,
> > > > >>>
> > > > >>> The commit
> > > > >>>
> > > > >>> https://github.com/rajinisivaram/kafka/commit/
> > > > b9d711907ad843c11d17e80d6743bfb1d4e3f3fd
> > > > >>> shows
> > > > >>> the kind of flow I was thinking of. It is just a prototype with a
> > > fixed
> > > > >>> re-authentication period to explore the possibility of
> implementing
> > > > >>> re-authentication similar to authentication. There will be edge
> > cases
> > > > to
> > > > >>> cover and errors to handle, but hopefully the code makes the
> > approach
> > > > >>> clearer than my earlier explanation!
> > > > >>>
> > > > >>> So the differences in the two implementations as you have already
> > > > >>> mentioned
> > > > >>> earlier.
> > > > >>>
> > > > >>>    1. Re-authentication sequences are not interleaved with Kafka
> > > > >>> requests.
> > > > >>>    As you said, this has a higher impact on latency. IMO, this
> also
> > > > >>> makes it
> > > > >>>    easier to debug, especially with complex protocols like
> Kerberos
> > > > >>> which are
> > > > >>>    notoriously difficult to diagnose.
> > > > >>>    2. Re-authentication failures will not be retried, they will
> be
> > > > >>> treated
> > > > >>>    as fatal errors similar to authentication failures. IMO, since
> > we
> > > > >>> rely on
> > > > >>>    brokers never rejecting valid authentication requests (clients
> > > treat
> > > > >>>    authentication failures as fatal errors), it is ok to fail on
> > > > >>>    re-authentication failure as well.
> > > > >>>    3. Re-authentication is performed on the network thread on
> > brokers
> > > > >>>    similar to authentication (in your implementation, I think it
> > > would
> > > > >>> be on
> > > > >>>    the request thread). IMO, it is better do all authentications
> > > using
> > > > >>> the
> > > > >>>    same thread pool.
> > > > >>>    4. Code complexity: I don't think actual code complexity would
> > be
> > > > very
> > > > >>>    different between the two implementations. But I think there
> is
> > > > value
> > > > >>> in
> > > > >>>    keeping re-authentication code within existing
> network/security
> > > > >>> layers. The
> > > > >>>    number of classes modified will be smaller and the number of
> > > > packages
> > > > >>>    modified even smaller.
> > > > >>>
> > > > >>> Anyway, let me know what you think:
> > > > >>>
> > > > >>>    - Will this approach work for your scenarios?
> > > > >>>    - Do we really need to handle re-authentication differently
> from
> > > > >>>    authentication?
> > > > >>>
> > > > >>>
> > > > >>> On Thu, Sep 6, 2018 at 1:40 PM, Ron Dagostino <[email protected]
> >
> > > > wrote:
> > > > >>>
> > > > >>> > Yeah, regarding ControllerChannelManager, it is one of the
> > > > synchronous
> > > > >>> I/O
> > > > >>> > use cases (along with 2 others: KafkaProducer, via Sender; and
> > > > >>> > ReplicaFetcherBlockingSend, via ReplicaFetcherThread) where the
> > > > >>> assumption
> > > > >>> > is complete ownership of the connection. The PR's approach to
> > > dealing
> > > > >>> with
> > > > >>> > that assumption is to inject the re-authentication requests
> into
> > > the
> > > > >>> > owner's existing flow so that they are simply sent with a
> > callback
> > > > that
> > > > >>> > NetworkClient executes.  The alternative approach, which is
> what
> > I
> > > > >>> think
> > > > >>> > you are investigating, is to allow the connection to be
> > temporarily
> > > > >>> taken
> > > > >>> > offline and having any attempt by ControllerChannelManager (or
> > > other
> > > > >>> > synchronous use case owner) to use the connection while it is
> in
> > > this
> > > > >>> > "offline" state result in some kind of pause until the
> connection
> > > > comes
> > > > >>> > back online.  One issue with this approach might be the length
> of
> > > > time
> > > > >>> that
> > > > >>> > the connection is unavailable; will it be offline for all
> > > > >>> authentication
> > > > >>> > requests and responses (ApiVersionsRequest/Response,
> > > > >>> > SaslHandshakeRequest/Response, and SaslAuthenticateRequest/
> > > > Response)?
> > > > >>> > Note
> > > > >>> > the last one could actually be invoked multiple times, so there
> > > could
> > > > >>> be 4
> > > > >>> > or more round-trips before the authentication "dance" is
> > finished.
> > > > >>> Will
> > > > >>> > the connection be "offline" the entire time, or will it be
> placed
> > > > back
> > > > >>> > "online" in between each request/response pair to allow the
> owner
> > > of
> > > > >>> the
> > > > >>> > connection to use it -- in which case the authentication
> process
> > > > would
> > > > >>> have
> > > > >>> > to wait to get ownership again?  The approach I took
> interleaves
> > > the
> > > > >>> > authentication requests/responses with whatever the owner is
> > doing,
> > > > so
> > > > >>> it
> > > > >>> > is conceivable that use of the connection jumps back and forth
> > > > between
> > > > >>> the
> > > > >>> > two purposes.  Such jumping back and forth minimizes any added
> > > > latency
> > > > >>> due
> > > > >>> > to the re-authentication, of course.
> > > > >>> >
> > > > >>> > Anyway, I'll look forward to finding out what you are able to
> > > > conclude.
> > > > >>> > Good luck :-)
> > > > >>> >
> > > > >>> > Ron
> > > > >>> >
> > > > >>> > On Thu, Sep 6, 2018 at 5:17 AM Rajini Sivaram <
> > > > [email protected]
> > > > >>> >
> > > > >>> > wrote:
> > > > >>> >
> > > > >>> > > Hi Ron,
> > > > >>> > >
> > > > >>> > > Disconnections on the broker-side: I think we should do
> > > > >>> disconnections
> > > > >>> > as a
> > > > >>> > > separate KIP and PR as you originally intended. But that one
> > > could
> > > > be
> > > > >>> > done
> > > > >>> > > separately without requiring KIP-368 as a pre-req. As a
> simpler
> > > > >>> > > implementation and one that can be used without KIP-368 in
> some
> > > > >>> cases, we
> > > > >>> > > could commit that first since this one may take longer.
> > > > >>> > >
> > > > >>> > > I wasn't suggesting that we do move re-authentication to
> > > > >>> NetworkClient, I
> > > > >>> > > was thinking of the lower layers, handling authentication and
> > > > >>> > > reauthentication at the same layer in a similar way. Let me
> > look
> > > > >>> into the
> > > > >>> > > code and come up with a more detailed explanation to avoid
> > > > confusion.
> > > > >>> > >
> > > > >>> > > I am not too concerned about the imports in KafkaChannel. As
> > you
> > > > >>> said, we
> > > > >>> > > can improve that if we need to. KafkaChannels are aware of
> > > > >>> > > network/authentication states and if that becomes a bit more
> > > > >>> complex, I
> > > > >>> > > don't think it would matter so much. My concern is about
> > changes
> > > > like
> > > > >>> > >
> > > > >>> > > https://github.com/apache/kafka/pull/5582/files#diff-
> > > > >>> > 987fef43991384a3ebec5fb55e53b577
> > > > >>> > > in ControllerChannelManager. Those classes shouldn't have
> deal
> > > with
> > > > >>> SASL
> > > > >>> > or
> > > > >>> > > reauthentication. Anyway, I will get back with more detail on
> > > what
> > > > I
> > > > >>> had
> > > > >>> > in
> > > > >>> > > mind since that may not be viable at all.
> > > > >>> > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On Thu, Sep 6, 2018 at 1:44 AM, Ron Dagostino <
> > [email protected]
> > > >
> > > > >>> wrote:
> > > > >>> > >
> > > > >>> > > > I just thought of another alternative if the imports are
> the
> > > > >>> concern.
> > > > >>> > > > KafkaChannel could expose the fact that it can create an
> > > > additional
> > > > >>> > > > Authenticator instance on the side (what I referred to as
> > > > >>> > > > notYetAuthenticatedAuthenticator in the PR) and it could
> let
> > > > >>> > > > kafka.server.KafkaApis drive the whole thing -- create the
> > > > >>> instance on
> > > > >>> > > the
> > > > >>> > > > side, clean it up if it fails, move it into place and close
> > the
> > > > >>> old one
> > > > >>> > > if
> > > > >>> > > > it succeeds, etc.  Then KafkaChannel wouldn't need to
> import
> > > > >>> anything
> > > > >>> > new
> > > > >>> > > > -- it just exposes its Authenticator and the ability to
> > perform
> > > > the
> > > > >>> > swap
> > > > >>> > > > upon success, etc.
> > > > >>> > > >
> > > > >>> > > > Ron
> > > > >>> > > >
> > > > >>> > > > On Wed, Sep 5, 2018 at 5:01 PM Ron Dagostino <
> > > [email protected]>
> > > > >>> > wrote:
> > > > >>> > > >
> > > > >>> > > > > Hi again, Rajini,  I realized a couple of potential
> > concerns
> > > > with
> > > > >>> > using
> > > > >>> > > > > the TransportLayer directly during re-authentication.
> > >  First,
> > > > >>> in the
> > > > >>> > > > > blocking I/O use case, the owner of the NetworkClient
> > > instance
> > > > >>> calls
> > > > >>> > > > > NetworkClientUtils.sendAndReceive() to send requests.
> >  This
> > > > >>> method
> > > > >>> > > > > assumes the caller has exclusive access to the
> > NetworkClient,
> > > > so
> > > > >>> it
> > > > >>> > > does
> > > > >>> > > > > not check to see if the node is ready; it just sends the
> > > > request
> > > > >>> and
> > > > >>> > > > > repeatedly calls poll() until the response arrives.  if
> we
> > > were
> > > > >>> to
> > > > >>> > take
> > > > >>> > > > > the connection temporarily offline that method currently
> > has
> > > no
> > > > >>> > > mechanism
> > > > >>> > > > > for checking for such a state before it sends the
> request;
> > we
> > > > >>> could
> > > > >>> > add
> > > > >>> > > > it,
> > > > >>> > > > > but we would have to put in some kind of sleep loop to
> keep
> > > > >>> checking
> > > > >>> > > for
> > > > >>> > > > it
> > > > >>> > > > > to come back "online" before it could send.  Adding such
> a
> > > > sleep
> > > > >>> loop
> > > > >>> > > > could
> > > > >>> > > > > be done, of course, but it doesn't sound ideal.
> > > > >>> > > > >
> > > > >>> > > > > A similar sleep loop situation exists in the async use
> > case.
> > > > The
> > > > >>> > owner
> > > > >>> > > > > repeatedly calls poll() in a loop, but if the connection
> to
> > > the
> > > > >>> node
> > > > >>> > is
> > > > >>> > > > > temporarily offline then the poll() method would have to
> > > enter
> > > > a
> > > > >>> > sleep
> > > > >>> > > > > loop until either the connection comes back online or the
> > > > timeout
> > > > >>> > > elapses
> > > > >>> > > > > (whichever comes first).
> > > > >>> > > > >
> > > > >>> > > > > I don't know if there is an aversion to adding sleep
> loops
> > > like
> > > > >>> that,
> > > > >>> > > so
> > > > >>> > > > > maybe it isn't a big issue, but I wanted to raise it as a
> > > > >>> potential
> > > > >>> > > > concern
> > > > >>> > > > > with this approach.
> > > > >>> > > > >
> > > > >>> > > > > Also, is the import of SASL-specific classes in
> > KafkaChannel
> > > a
> > > > >>> major
> > > > >>> > > > > objection to the current implementation?  I could
> eliminate
> > > > that
> > > > >>> by
> > > > >>> > > > > replacing the 2 offending methods in KafkaChannel with
> this
> > > one
> > > > >>> and
> > > > >>> > > > > having the implementation delegate to the authenticator:
> > > > >>> > > > >
> > > > >>> > > > >     /**
> > > > >>> > > > >      * Respond to a re-authentication request. This
> occurs
> > on
> > > > the
> > > > >>> > > > >      * Server side of the re-authentication dance (i.e.
> on
> > > the
> > > > >>> > broker).
> > > > >>> > > > >      *
> > > > >>> > > > >      * @param requestHeader
> > > > >>> > > > >      *            the request header
> > > > >>> > > > >      * @param request
> > > > >>> > > > >      *            the request to process
> > > > >>> > > > >      * @return the response to return to the client
> > > > >>> > > > >      */
> > > > >>> > > > >     public AbstractResponse
> respondToReauthenticationReque
> > > > >>> > > > st(RequestHeader
> > > > >>> > > > > requestHeader,
> > > > >>> > > > >             AbstractRequest request)
> > > > >>> > > > >
> > > > >>> > > > > There is practically no work being done in the
> KafkaChannel
> > > > >>> instance
> > > > >>> > > > > anyway -- it does some sanity checking but otherwise
> > > delegates
> > > > >>> to the
> > > > >>> > > > > authenticator.  We could just add a method to the
> > > Authenticator
> > > > >>> > > interface
> > > > >>> > > > > and delegate the whole thing.
> > > > >>> > > > >
> > > > >>> > > > > Ron
> > > > >>> > > > >
> > > > >>> > > > > On Wed, Sep 5, 2018 at 2:07 PM Ron Dagostino <
> > > > [email protected]>
> > > > >>> > > wrote:
> > > > >>> > > > >
> > > > >>> > > > >> Hi Rajini.  I'm now skeptical of my "ConnectionState.
> > > > >>> > REAUTHENTICATING"
> > > > >>> > > > idea.
> > > > >>> > > > >> The concept of a connection being "READY" or not can
> > impact
> > > > >>> > > > >> ConsumerCoordinator (see, for example,
> > > > >>> > > > >> https://github.com/apache/kafka/blob/trunk/clients/src/
> > > > >>> > > > main/java/org/apache/kafka/clients/consumer/internals/
> > > > >>> > > > ConsumerCoordinator.java#L352).
> > > > >>> > > > >> The semantics of a connection being "READY" and the
> > > > >>> > > > >> implications/assumptions are not clear, and I suspect
> > there
> > > > >>> will be
> > > > >>> > > some
> > > > >>> > > > >> unintended consequences of this approach that may not be
> > > > >>> immediately
> > > > >>> > > > >> apparent.
> > > > >>> > > > >>
> > > > >>> > > > >> I will confess that when I was implementing
> > > re-authentication
> > > > I
> > > > >>> > > thought
> > > > >>> > > > >> it might be worthwhile to unify the
> authentication-related
> > > > code
> > > > >>> > bases
> > > > >>> > > --
> > > > >>> > > > >> except I suspected it would be good to go in the other
> > > > >>> direction:
> > > > >>> > have
> > > > >>> > > > the
> > > > >>> > > > >> current code that directly uses the TransportLayer
> instead
> > > use
> > > > >>> > > > >> AuthenticationSuccessOrFailureReceiver and
> > > > >>> > > > AuthenticationRequestEnqueuer.
> > > > >>> > > > >> I'm not advocating that we do it -- I decided to not go
> > > there
> > > > >>> when
> > > > >>> > > > creating
> > > > >>> > > > >> the PR, after all -- but I did get a strong feeling that
> > > > >>> directly
> > > > >>> > > using
> > > > >>> > > > the
> > > > >>> > > > >> TransportLayer as is currently done is really only
> viable
> > > > before
> > > > >>> > > anybody
> > > > >>> > > > >> else starts using the connection.  If we want to use the
> > > > >>> > > TransportLayer
> > > > >>> > > > again
> > > > >>> > > > >> after that point then it is up to us to somehow take the
> > > > >>> connection
> > > > >>> > > > >> "temporarily offline" so that we have exclusive rights
> to
> > it
> > > > >>> again,
> > > > >>> > > and
> > > > >>> > > > I
> > > > >>> > > > >> wonder if the concept of a connection being "temporarily
> > > > >>> offline" is
> > > > >>> > > > >> something the existing code is able to handle --
> probably
> > > not,
> > > > >>> and I
> > > > >>> > > > >> suspect there are unstated assumptions that would be
> > > > >>> invalidated.
> > > > >>> > > > >>
> > > > >>> > > > >> Do you think this particular "ConnectionState.
> > > > REAUTHENTICATING"
> > > > >>> > idea
> > > > >>> > > is
> > > > >>> > > > >> worth pursuing?  How about the general idea of trying to
> > use
> > > > the
> > > > >>> > > > >> TransportLayer directly -- are you still feeling like it
> > is
> > > > >>> viable?
> > > > >>> > > > >>
> > > > >>> > > > >> Ron
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >> Ron
> > > > >>> > > > >>
> > > > >>> > > > >> On Wed, Sep 5, 2018 at 11:40 AM Ron Dagostino <
> > > > >>> [email protected]>
> > > > >>> > > > wrote:
> > > > >>> > > > >>
> > > > >>> > > > >>> <<<in favor of implementing server-side kill in
> addition
> > to
> > > > >>> > > > >>> re-authentication, not as a replacement.
> > > > >>> > > > >>> <<<I think Rajini suggested the same thing
> > > > >>> > > > >>> Oh, ok, I misunderstood.  Then I think we are on the
> same
> > > > >>> page: if
> > > > >>> > we
> > > > >>> > > > >>> are going to do re-authentication, then we should also
> do
> > > > >>> > > > >>> server-side-kill-upon-expiration as part of the same
> > > > >>> > implementation.
> > > > >>> > > > I'm
> > > > >>> > > > >>> good with that.
> > > > >>> > > > >>>
> > > > >>> > > > >>> I am also looking into Rajini's idea of doing
> > > > >>> re-authentication at
> > > > >>> > > the
> > > > >>> > > > >>> NetworkClient level and reusing the existing
> > authentication
> > > > >>> code
> > > > >>> > > > path.  I
> > > > >>> > > > >>> was skeptical when she suggested it, but now that I
> look
> > > > >>> closer I
> > > > >>> > see
> > > > >>> > > > >>> something that I can try.  NetworkClient has logic to
> > > > >>> recognize the
> > > > >>> > > > >>> state "ConnectionState.CONNECTING" as meaning "you
> can't
> > do
> > > > >>> > anything
> > > > >>> > > > >>> with this connection at the moment; please wait."  I'm
> > > going
> > > > >>> to try
> > > > >>> > > > adding
> > > > >>> > > > >>> a new state "ConnectionState.REAUTHENTICATING" that
> would
> > > be
> > > > >>> > > > recognized
> > > > >>> > > > >>> in a similar way.  Then the challenge becomes inserting
> > > > myself
> > > > >>> into
> > > > >>> > > any
> > > > >>> > > > >>> existing flow that might be going on.  I'll probably
> add
> > > the
> > > > >>> > request
> > > > >>> > > > to set
> > > > >>> > > > >>> the state to "REAUTHENTICATING" to a queue if I can't
> > grab
> > > > the
> > > > >>> > state
> > > > >>> > > > >>> immediately and have the network client's poll() method
> > > check
> > > > >>> at
> > > > >>> > the
> > > > >>> > > > >>> end to see if any such requests can be granted; there
> > would
> > > > be
> > > > >>> a
> > > > >>> > > > callback
> > > > >>> > > > >>> associated with the request, and that way I can be
> > assured
> > > I
> > > > >>> would
> > > > >>> > be
> > > > >>> > > > >>> granted the request in a reasonable amount of time
> > > (assuming
> > > > >>> the
> > > > >>> > > > connection
> > > > >>> > > > >>> doesn't close in the meantime).  Then it would be up
> the
> > > > >>> callback
> > > > >>> > > > >>> implementation to perform the re-authentication dance
> and
> > > set
> > > > >>> the
> > > > >>> > > state
> > > > >>> > > > >>> back to "ConnectionState.READY".  I don't know if this
> > will
> > > > >>> work,
> > > > >>> > and
> > > > >>> > > > >>> I'm probably missing some subtleties at the moment, but
> > > I'll
> > > > >>> give
> > > > >>> > it
> > > > >>> > > a
> > > > >>> > > > shot
> > > > >>> > > > >>> and see what happens.
> > > > >>> > > > >>>
> > > > >>> > > > >>> Ron
> > > > >>> > > > >>>
> > > > >>> > > > >>> On Wed, Sep 5, 2018 at 11:23 AM Colin McCabe <
> > > > >>> [email protected]>
> > > > >>> > > > wrote:
> > > > >>> > > > >>>
> > > > >>> > > > >>>> On Wed, Sep 5, 2018, at 07:34, Ron Dagostino wrote:
> > > > >>> > > > >>>> > I added a "How To Support Re-Authentication for
> Other
> > > SASL
> > > > >>> > > > Mechanisms"
> > > > >>> > > > >>>> > section to the KIP as Rajini suggested.  I also
> added
> > a
> > > > >>> > "Rejected
> > > > >>> > > > >>>> > Alternative" for the idea of forcibly closing
> > > connections
> > > > >>> on the
> > > > >>> > > > >>>> client
> > > > >>> > > > >>>> > side upon token refresh or on the server side upon
> > token
> > > > >>> > > expiration.
> > > > >>> > > > >>>> It
> > > > >>> > > > >>>> > may be a bit premature to reject the server-side
> kill
> > > > >>> scenario
> > > > >>> > > given
> > > > >>> > > > >>>> that
> > > > >>> > > > >>>> > Colin and Rajini are partial to it, but see below
> for
> > > > what I
> > > > >>> > said
> > > > >>> > > > >>>> about it,
> > > > >>> > > > >>>> > and I think it makes sense -- server-side kill
> without
> > > an
> > > > >>> > ability
> > > > >>> > > > for
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > client to re-authenticate to avoid it may be useful
> in
> > > > >>> certain
> > > > >>> > > > >>>> specific
> > > > >>> > > > >>>> > cases, but as a general feature it doesn't really
> > > work.  I
> > > > >>> would
> > > > >>> > > be
> > > > >>> > > > >>>> willing
> > > > >>> > > > >>>> > to add server-side-kill to the scope of this KIP if
> > that
> > > > is
> > > > >>> > > desired.
> > > > >>> > > > >>>>
> > > > >>> > > > >>>> Hi Ron,
> > > > >>> > > > >>>>
> > > > >>> > > > >>>> To clarify, I am in favor of implementing server-side
> > kill
> > > > in
> > > > >>> > > addition
> > > > >>> > > > >>>> to re-authentication, not as a replacement.  I think
> > > Rajini
> > > > >>> > > suggested
> > > > >>> > > > the
> > > > >>> > > > >>>> same thing.
> > > > >>> > > > >>>>
> > > > >>> > > > >>>> It seems clear that server-side kill is needed to
> > provide
> > > > >>> > security.
> > > > >>> > > > >>>> Otherwise a bad client can simply decide not to
> > > > >>> re-authenticate,
> > > > >>> > and
> > > > >>> > > > >>>> continue using server resources indefinitely.  Neither
> > > > >>> > > authentication
> > > > >>> > > > nor
> > > > >>> > > > >>>> re-authentication should be optional, or else the bad
> > guys
> > > > >>> will
> > > > >>> > > > simply take
> > > > >>> > > > >>>> the option not to authenticate.
> > > > >>> > > > >>>>
> > > > >>> > > > >>>> best,
> > > > >>> > > > >>>> Colin
> > > > >>> > > > >>>>
> > > > >>> > > > >>>>
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> > A brute-force alternative is to simply kill the
> > > connection
> > > > >>> on
> > > > >>> > the
> > > > >>> > > > >>>> client
> > > > >>> > > > >>>> > > side when the background login thread refreshes
> the
> > > > >>> > credential.
> > > > >>> > > > The
> > > > >>> > > > >>>> > > advantage is that we don't need a code path for
> > > > >>> > > re-authentication
> > > > >>> > > > –
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > > client simply connects again to replace the
> > connection
> > > > >>> that
> > > > >>> > was
> > > > >>> > > > >>>> killed.
> > > > >>> > > > >>>> > > There are many disadvantages, though.  The
> approach
> > is
> > > > >>> harsh –
> > > > >>> > > > >>>> having
> > > > >>> > > > >>>> > > connections pulled out from underneath the client
> > will
> > > > >>> > introduce
> > > > >>> > > > >>>> latency
> > > > >>> > > > >>>> > > while the client reconnects; it introduces
> > non-trivial
> > > > >>> > resource
> > > > >>> > > > >>>> utilization
> > > > >>> > > > >>>> > > on both the client and server as TLS is
> > renegotiated;
> > > > and
> > > > >>> it
> > > > >>> > > > forces
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > > client to periodically "recover" from what
> > essentially
> > > > >>> looks
> > > > >>> > > like
> > > > >>> > > > a
> > > > >>> > > > >>>> failure
> > > > >>> > > > >>>> > > scenario.  While these are significant
> > disadvantages,
> > > > the
> > > > >>> most
> > > > >>> > > > >>>> significant
> > > > >>> > > > >>>> > > disadvantage of all is that killing connections on
> > the
> > > > >>> client
> > > > >>> > > side
> > > > >>> > > > >>>> adds no
> > > > >>> > > > >>>> > > security – trusting the client to kill its
> > connection
> > > > in a
> > > > >>> > > timely
> > > > >>> > > > >>>> fashion
> > > > >>> > > > >>>> > > is a blind and unjustifiable trust.
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> > > We could kill the connection from the server side
> > > > instead,
> > > > >>> > when
> > > > >>> > > > the
> > > > >>> > > > >>>> token
> > > > >>> > > > >>>> > > expires.  But in this case, if there is no ability
> > for
> > > > the
> > > > >>> > > client
> > > > >>> > > > to
> > > > >>> > > > >>>> > > re-authenticate to avoid the killing of the
> > connection
> > > > in
> > > > >>> the
> > > > >>> > > > first
> > > > >>> > > > >>>> place,
> > > > >>> > > > >>>> > > then we still have all of the harsh approach
> > > > disadvantages
> > > > >>> > > > >>>> mentioned above.
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> > Ron
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> > On Wed, Sep 5, 2018 at 10:25 AM Colin McCabe <
> > > > >>> > [email protected]>
> > > > >>> > > > >>>> wrote:
> > > > >>> > > > >>>> >
> > > > >>> > > > >>>> > > On Wed, Sep 5, 2018, at 01:41, Rajini Sivaram
> wrote:
> > > > >>> > > > >>>> > > > *Re-authentication vs disconnection:*
> > > > >>> > > > >>>> > > > In a vast number of secure Kafka deployments,
> > > SASL_SSL
> > > > >>> is
> > > > >>> > the
> > > > >>> > > > >>>> security
> > > > >>> > > > >>>> > > > protocol (this is the recommended config for
> > > > >>> OAUTHBEARER).
> > > > >>> > If
> > > > >>> > > we
> > > > >>> > > > >>>> require
> > > > >>> > > > >>>> > > > disconnections on token expiry, we would need
> new
> > > > >>> > connections
> > > > >>> > > to
> > > > >>> > > > >>>> be
> > > > >>> > > > >>>> > > > established with an expensive SSL handshake.
> This
> > > adds
> > > > >>> load
> > > > >>> > on
> > > > >>> > > > >>>> the broker
> > > > >>> > > > >>>> > > > and will result in a latency spike. For
> > OAUTHBEARER
> > > in
> > > > >>> > > > >>>> particular, when
> > > > >>> > > > >>>> > > > tokens are used to make authorisation decisions,
> > we
> > > > >>> want to
> > > > >>> > > be a
> > > > >>> > > > >>>> able to
> > > > >>> > > > >>>> > > > support short-lived tokens where token lifetime
> > > > >>> (granting
> > > > >>> > > > >>>> authorisation)
> > > > >>> > > > >>>> > > is
> > > > >>> > > > >>>> > > > small. To make this usable in practice, I
> believe
> > we
> > > > >>> need to
> > > > >>> > > > >>>> support
> > > > >>> > > > >>>> > > > re-authentication of existing connections.
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> > > Hi Rajini,
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> > > Thanks for the explanation.  That makes sense.
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> > > >
> > > > >>> > > > >>>> > > > *Also explicitly out-of-scope for this proposal
> is
> > > the
> > > > >>> > ability
> > > > >>> > > > for
> > > > >>> > > > >>>> > > brokers
> > > > >>> > > > >>>> > > > to close connections that continue to use
> expired
> > > > >>> > credentials.
> > > > >>> > > > >>>> This
> > > > >>> > > > >>>> > > > ability is a natural next step, but it will be
> > > > addressed
> > > > >>> > via a
> > > > >>> > > > >>>> separate
> > > > >>> > > > >>>> > > KIP
> > > > >>> > > > >>>> > > > if/when this one is adopted.*
> > > > >>> > > > >>>> > > > Perhaps we could do this the other way round? I
> > > don't
> > > > >>> think
> > > > >>> > we
> > > > >>> > > > >>>> would ever
> > > > >>> > > > >>>> > > > want to close connections on the client-side to
> > > > support
> > > > >>> > > expired
> > > > >>> > > > >>>> > > credentials
> > > > >>> > > > >>>> > > > because that doesn't add any security
> guarantees.
> > > But
> > > > >>> we do
> > > > >>> > > > >>>> require the
> > > > >>> > > > >>>> > > > ability for brokers to close connections in
> order
> > to
> > > > >>> enforce
> > > > >>> > > > >>>> credential
> > > > >>> > > > >>>> > > > expiry. Disconnection on the broker-side may be
> > > > >>> sufficient
> > > > >>> > for
> > > > >>> > > > >>>> some
> > > > >>> > > > >>>> > > > deployments and could be useful on its own. It
> > would
> > > > >>> also be
> > > > >>> > > the
> > > > >>> > > > >>>> easier
> > > > >>> > > > >>>> > > > implementation. So maybe that could be the first
> > > step?
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> > > +1 for doing disconnection first.  Otherwise, as
> you
> > > > >>> noted,
> > > > >>> > > there
> > > > >>> > > > >>>> are no
> > > > >>> > > > >>>> > > security guarantees -- the client can just decide
> > not
> > > to
> > > > >>> > > > >>>> re-authenticate
> > > > >>> > > > >>>> > > and keep using the old credentials.  You don't
> even
> > > need
> > > > >>> to
> > > > >>> > > modify
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > > source code -- older clients would behave this
> way.
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> > > best,
> > > > >>> > > > >>>> > > Colin
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>> > > >
> > > > >>> > > > >>>> > > > *The implementation is designed in such a way
> that
> > > it
> > > > >>> does
> > > > >>> > not
> > > > >>> > > > >>>> preclude
> > > > >>> > > > >>>> > > > adding support for re-authentication of other
> SASL
> > > > >>> mechanism
> > > > >>> > > > >>>> (e.g. PLAIN,
> > > > >>> > > > >>>> > > > SCRAM-related, and GSSAPI), but doing so is
> > > explicitly
> > > > >>> > > > >>>> out-of-scope for
> > > > >>> > > > >>>> > > > this proposal. *
> > > > >>> > > > >>>> > > > Isn't re-authentication driven by
> > > ExpiringCredential?
> > > > We
> > > > >>> > don't
> > > > >>> > > > >>>> need to
> > > > >>> > > > >>>> > > > support re-authentication by default for the
> other
> > > > >>> > mechanisms
> > > > >>> > > in
> > > > >>> > > > >>>> this
> > > > >>> > > > >>>> > > KIP,
> > > > >>> > > > >>>> > > > but any mechanism could enable this by adding a
> > > custom
> > > > >>> login
> > > > >>> > > > >>>> callback
> > > > >>> > > > >>>> > > > handler to provide an ExpiringCredential? For
> > > > >>> disconnection
> > > > >>> > as
> > > > >>> > > > >>>> well as
> > > > >>> > > > >>>> > > > re-authentication, it will be good if we can
> > specify
> > > > >>> exactly
> > > > >>> > > how
> > > > >>> > > > >>>> it can
> > > > >>> > > > >>>> > > be
> > > > >>> > > > >>>> > > > implemented for each of the SASL mechanisms,
> even
> > if
> > > > we
> > > > >>> > > actually
> > > > >>> > > > >>>> > > implement
> > > > >>> > > > >>>> > > > it only for OAUTHBEARER.
> > > > >>> > > > >>>> > > >
> > > > >>> > > > >>>> > > >
> > > > >>> > > > >>>> > > > On Wed, Sep 5, 2018 at 2:43 AM, Colin McCabe <
> > > > >>> > > > [email protected]>
> > > > >>> > > > >>>> wrote:
> > > > >>> > > > >>>> > > >
> > > > >>> > > > >>>> > > > > On Tue, Sep 4, 2018, at 17:43, Ron Dagostino
> > > wrote:
> > > > >>> > > > >>>> > > > > > Hi Colin.  Different organizations will rely
> > on
> > > > >>> > different
> > > > >>> > > > >>>> token
> > > > >>> > > > >>>> > > > > lifetimes,
> > > > >>> > > > >>>> > > > > > but anything shorter than an hour feels like
> > it
> > > > >>> would be
> > > > >>> > > > >>>> pretty
> > > > >>> > > > >>>> > > > > > aggressive.  An hour or more will probably
> be
> > > most
> > > > >>> > common.
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > Thanks.  That's helpful to give me a sense of
> > what
> > > > the
> > > > >>> > > > >>>> performance
> > > > >>> > > > >>>> > > impact
> > > > >>> > > > >>>> > > > > might be.
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > >
> > > > >>> > > > >>>> > > > > > <<<alternate solution of terminating
> > connections
> > > > >>> when
> > > > >>> > the
> > > > >>> > > > >>>> bearer
> > > > >>> > > > >>>> > > token
> > > > >>> > > > >>>> > > > > > changed
> > > > >>> > > > >>>> > > > > > I may be mistaken, but I think you are
> > > suggesting
> > > > >>> here
> > > > >>> > > that
> > > > >>> > > > we
> > > > >>> > > > >>>> > > forcibly
> > > > >>> > > > >>>> > > > > > kill connections from the client side
> whenever
> > > the
> > > > >>> > > > background
> > > > >>> > > > >>>> Login
> > > > >>> > > > >>>> > > > > refresh
> > > > >>> > > > >>>> > > > > > thread refreshes the token (which it
> currently
> > > > does
> > > > >>> so
> > > > >>> > > that
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > > client
> > > > >>> > > > >>>> > > > > can
> > > > >>> > > > >>>> > > > > > continue to make new connections).  Am I
> > > correct?
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > Yes, this is what I'm thinking about.  We
> could
> > > also
> > > > >>> > > terminate
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > > > > connection on the server, if that is more
> > > > convenient.
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > >  If that is what you are
> > > > >>> > > > >>>> > > > > > referring to, my sense is that it would be a
> > > very
> > > > >>> crude
> > > > >>> > > way
> > > > >>> > > > of
> > > > >>> > > > >>>> > > dealing
> > > > >>> > > > >>>> > > > > with
> > > > >>> > > > >>>> > > > > > the issue that would probably lead to
> > > > >>> dissatisfaction in
> > > > >>> > > > some
> > > > >>> > > > >>>> sense
> > > > >>> > > > >>>> > > > > (though
> > > > >>> > > > >>>> > > > > > I can't be sure).
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > What information should we gather so that we
> can
> > > be
> > > > >>> sure?
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > >  I do know that when I implemented
> > > > SASL/OAUTHBEARER
> > > > >>> it
> > > > >>> > > > >>>> > > > > > was communicated that leaving existing
> > > connections
> > > > >>> > intact
> > > > >>> > > --
> > > > >>> > > > >>>> as is
> > > > >>> > > > >>>> > > done
> > > > >>> > > > >>>> > > > > for
> > > > >>> > > > >>>> > > > > > GSSAPI -- was the appropriate path forward.
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > Thanks, that's good background information.
> Can
> > > > >>> someone
> > > > >>> > > chime
> > > > >>> > > > >>>> in with
> > > > >>> > > > >>>> > > the
> > > > >>> > > > >>>> > > > > reasoning behind this?
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > My best guess is that terminating connections
> > > might
> > > > >>> cause
> > > > >>> > a
> > > > >>> > > > >>>> temporary
> > > > >>> > > > >>>> > > > > increase in latency as they are
> re-established.
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > In any case, we should figure out what the
> > > reasoning
> > > > >>> is so
> > > > >>> > > > that
> > > > >>> > > > >>>> we can
> > > > >>> > > > >>>> > > > > make a decision.  It seems worthwhile
> including
> > > this
> > > > >>> as a
> > > > >>> > > > >>>> "rejected
> > > > >>> > > > >>>> > > > > alternative," at least.
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > thanks,
> > > > >>> > > > >>>> > > > > Colin
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > > > > >
> > > > >>> > > > >>>> > > > > > Ron
> > > > >>> > > > >>>> > > > > >
> > > > >>> > > > >>>> > > > > > On Tue, Sep 4, 2018 at 8:31 PM Colin McCabe
> <
> > > > >>> > > > >>>> [email protected]>
> > > > >>> > > > >>>> > > wrote:
> > > > >>> > > > >>>> > > > > >
> > > > >>> > > > >>>> > > > > > > Hi Ron,
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > > Thanks for the KIP.
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > > What is the frequency at which you
> envision
> > > > bearer
> > > > >>> > > tokens
> > > > >>> > > > >>>> changing
> > > > >>> > > > >>>> > > at?
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > > Did you consider the alternate solution of
> > > > >>> terminating
> > > > >>> > > > >>>> connections
> > > > >>> > > > >>>> > > when
> > > > >>> > > > >>>> > > > > > > the bearer token changed?
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > > best,
> > > > >>> > > > >>>> > > > > > > Colin
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > > On Tue, Aug 28, 2018, at 07:28, Ron
> > Dagostino
> > > > >>> wrote:
> > > > >>> > > > >>>> > > > > > > > Hi everyone. I created KIP 368: Allow
> SASL
> > > > >>> > Connections
> > > > >>> > > > to
> > > > >>> > > > >>>> > > > > Periodically
> > > > >>> > > > >>>> > > > > > > > Re-Authenticate
> > > > >>> > > > >>>> > > > > > > > <
> > > > >>> > > > >>>> > > > > > > https://cwiki.apache.org/
> > > > >>> > confluence/display/KAFKA/KIP-
> > > > >>> > > > >>>> > > > >
> > > > >>> > > 368%3A+Allow+SASL+Connections+to+Periodically+Re-Authenticat
> e
> > > > >>> > > > >>>> > > > > > > >
> > > > >>> > > > >>>> > > > > > > > (
> > > > >>> > > > >>>> > > > > > > >
> > > > >>> > > > >>>> > > > > > > https://cwiki.apache.org/
> > > > >>> > confluence/display/KAFKA/KIP-
> > > > >>> > > > >>>> > > > >
> > > > >>> > > 368%3A+Allow+SASL+Connections+to+Periodically+Re-Authenticat
> e
> > > > >>> > > > >>>> > > > > > > ).
> > > > >>> > > > >>>> > > > > > > > The motivation for this KIP is as
> follows:
> > > > >>> > > > >>>> > > > > > > >
> > > > >>> > > > >>>> > > > > > > > The adoption of KIP-255: OAuth
> > > Authentication
> > > > >>> via
> > > > >>> > > > >>>> > > SASL/OAUTHBEARER
> > > > >>> > > > >>>> > > > > > > > <
> > > > >>> > > > >>>> > > > > > >
> > > > >>> https://cwiki.apache.org/confluence/pages/viewpage.
> > > > >>> > > > >>>> > > > > action?pageId=75968876>
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > > > > > in
> > > > >>> > > > >>>> > > > > > > > release 2.0.0 creates the possibility of
> > > using
> > > > >>> > > > >>>> information in the
> > > > >>> > > > >>>> > > > > bearer
> > > > >>> > > > >>>> > > > > > > > token to make authorization decisions.
> > > > >>> > Unfortunately,
> > > > >>> > > > >>>> however,
> > > > >>> > > > >>>> > > Kafka
> > > > >>> > > > >>>> > > > > > > > connections are long-lived, so there is
> no
> > > > >>> ability
> > > > >>> > to
> > > > >>> > > > >>>> change the
> > > > >>> > > > >>>> > > > > bearer
> > > > >>> > > > >>>> > > > > > > > token associated with a particular
> > > connection.
> > > > >>> > > Allowing
> > > > >>> > > > >>>> SASL
> > > > >>> > > > >>>> > > > > > > > connections
> > > > >>> > > > >>>> > > > > > > > to periodically re-authenticate would
> > > resolve
> > > > >>> this.
> > > > >>> > > In
> > > > >>> > > > >>>> addition
> > > > >>> > > > >>>> > > to
> > > > >>> > > > >>>> > > > > this
> > > > >>> > > > >>>> > > > > > > > motivation there are two others that are
> > > > >>> > > > security-related.
> > > > >>> > > > >>>> > > First, to
> > > > >>> > > > >>>> > > > > > > > eliminate access to Kafka for connected
> > > > >>> clients, the
> > > > >>> > > > >>>> current
> > > > >>> > > > >>>> > > > > requirement
> > > > >>> > > > >>>> > > > > > > > is
> > > > >>> > > > >>>> > > > > > > > to remove all authorizations (i.e.
> remove
> > > all
> > > > >>> ACLs).
> > > > >>> > > > >>>> This is
> > > > >>> > > > >>>> > > > > necessary
> > > > >>> > > > >>>> > > > > > > > because of the long-lived nature of the
> > > > >>> connections.
> > > > >>> > > It
> > > > >>> > > > >>>> is
> > > > >>> > > > >>>> > > > > > > > operationally
> > > > >>> > > > >>>> > > > > > > > simpler to shut off access at the point
> of
> > > > >>> > > > >>>> authentication, and
> > > > >>> > > > >>>> > > with
> > > > >>> > > > >>>> > > > > the
> > > > >>> > > > >>>> > > > > > > > release of KIP-86: Configurable SASL
> > > Callback
> > > > >>> > Handlers
> > > > >>> > > > >>>> > > > > > > > <
> > > > >>> > > > >>>> > > > > > > https://cwiki.apache.org/
> > > > >>> > confluence/display/KAFKA/KIP-
> > > > >>> > > > >>>> > > > > 86%3A+Configurable+SASL+callback+handlers
> > > > >>> > > > >>>> > > > > > > >
> > > > >>> > > > >>>> > > > > > > > it
> > > > >>> > > > >>>> > > > > > > > is going to become more and more likely
> > that
> > > > >>> > > > >>>> installations will
> > > > >>> > > > >>>> > > > > > > > authenticate users against external
> > > > directories
> > > > >>> > (e.g.
> > > > >>> > > > via
> > > > >>> > > > >>>> > > LDAP).  The
> > > > >>> > > > >>>> > > > > > > > ability to stop Kafka access by simply
> > > > >>> disabling an
> > > > >>> > > > >>>> account in an
> > > > >>> > > > >>>> > > > > LDAP
> > > > >>> > > > >>>> > > > > > > > directory (for example) is desirable.
> The
> > > > >>> second
> > > > >>> > > > >>>> motivating
> > > > >>> > > > >>>> > > factor
> > > > >>> > > > >>>> > > > > for
> > > > >>> > > > >>>> > > > > > > > re-authentication related to security is
> > > that
> > > > >>> the
> > > > >>> > use
> > > > >>> > > of
> > > > >>> > > > >>>> > > short-lived
> > > > >>> > > > >>>> > > > > > > > tokens
> > > > >>> > > > >>>> > > > > > > > is a common OAuth security
> recommendation,
> > > but
> > > > >>> > > issuing a
> > > > >>> > > > >>>> > > short-lived
> > > > >>> > > > >>>> > > > > > > > token
> > > > >>> > > > >>>> > > > > > > > to a Kafka client (or a broker when
> > > > OAUTHBEARER
> > > > >>> is
> > > > >>> > the
> > > > >>> > > > >>>> > > inter-broker
> > > > >>> > > > >>>> > > > > > > > protocol) currently has no benefit
> because
> > > > once
> > > > >>> a
> > > > >>> > > client
> > > > >>> > > > >>>> is
> > > > >>> > > > >>>> > > > > connected to
> > > > >>> > > > >>>> > > > > > > > a
> > > > >>> > > > >>>> > > > > > > > broker the client is never challenged
> > again
> > > > and
> > > > >>> the
> > > > >>> > > > >>>> connection
> > > > >>> > > > >>>> > > may
> > > > >>> > > > >>>> > > > > > > > remain
> > > > >>> > > > >>>> > > > > > > > intact beyond the token expiration time
> > (and
> > > > may
> > > > >>> > > remain
> > > > >>> > > > >>>> intact
> > > > >>> > > > >>>> > > > > > > > indefinitely
> > > > >>> > > > >>>> > > > > > > > under perfect circumstances).  This KIP
> > > > proposes
> > > > >>> > > adding
> > > > >>> > > > >>>> the
> > > > >>> > > > >>>> > > ability
> > > > >>> > > > >>>> > > > > for
> > > > >>> > > > >>>> > > > > > > > clients (and brokers when OAUTHBEARER is
> > the
> > > > >>> > > > inter-broker
> > > > >>> > > > >>>> > > protocol)
> > > > >>> > > > >>>> > > > > to
> > > > >>> > > > >>>> > > > > > > > re-authenticate their connections to
> > brokers
> > > > and
> > > > >>> > have
> > > > >>> > > > the
> > > > >>> > > > >>>> new
> > > > >>> > > > >>>> > > bearer
> > > > >>> > > > >>>> > > > > > > > token
> > > > >>> > > > >>>> > > > > > > > appear on their session rather than the
> > old
> > > > one.
> > > > >>> > > > >>>> > > > > > > >
> > > > >>> > > > >>>> > > > > > > > The description of this KIP is actually
> > > quite
> > > > >>> > > > >>>> straightforward
> > > > >>> > > > >>>> > > from a
> > > > >>> > > > >>>> > > > > > > > functionality perspective; from an
> > > > >>> implementation
> > > > >>> > > > >>>> perspective,
> > > > >>> > > > >>>> > > > > though,
> > > > >>> > > > >>>> > > > > > > the
> > > > >>> > > > >>>> > > > > > > > KIP is not so straightforward, so it
> > > includes
> > > > a
> > > > >>> pull
> > > > >>> > > > >>>> request
> > > > >>> > > > >>>> > > with a
> > > > >>> > > > >>>> > > > > > > > proposed implementation.  See
> > > > >>> > > > https://github.com/apache/
> > > > >>> > > > >>>> > > > > kafka/pull/5582.
> > > > >>> > > > >>>> > > > > > > >
> > > > >>> > > > >>>> > > > > > > > Ron
> > > > >>> > > > >>>> > > > > > >
> > > > >>> > > > >>>> > > > >
> > > > >>> > > > >>>> > >
> > > > >>> > > > >>>>
> > > > >>> > > > >>>
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > >
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to