Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Rajini Sivaram Mon, 17 Sep 2018 11:10:29 -0700

Hi Ron,


*1) Is there a reason to communicate this value back to the client when the
client already knows it?  It's an extra network round-trip, and since the
SASL round-tripsare defined by the spec I'm not certain adding an extra
round-trip is acceptable.*

I wasn't suggesting an extra round-trip for propagating session lifetime. I
was expecting session lifetime to be added to the last SASL_AUTHENTICATE
response from the broker. Because SASL is a challenge-response mechanism,
SaslServer knows when the response being sent is the last one and hence can
send the session lifetime in the response (in the same way as we propagate
errors). I was expecting this to be added as an extra field alongside
errors, not in the opaque body as mentioned in the KIP. The opaque byte
array strictly conforms to the SASL mechanism wire protocol and we want to
keep it that way.

As you have said, we don't need server to propagate session lifetime for
OAUTHBEARER since client knows the token lifetime. But server also knows
the credential lifetime and by having the server decide the lifetime, we
can use the same code path for all mechanisms. If we only have a constant
max lifetime in the server, for PLAIN and SCRAM we will end up having the
same lifetime for all credentials with no ability to set actual expiry. We
use SCRAM for delegation token based authentication where credentials have
an expiry time, so we do need to be able to set individual credential
lifetimes, where the server knows the expiry time but client may not.

*2) I also just realized that if the client is to learn the credential
lifetime we wouldn't want to put special-case code in the Authenticator for
GSSAPI and OAUTHBEARER; we would want to expose the value generically,
probably as a negotiated property on the SaslClient instance.*

I was trying to avoid this altogether. Client doesn't need to know
credential lifetime. Server asks the client to re-authenticate within its
session lifetime.

3) From the KIP, I wasn't entirely clear about the purpose of the two
configs:

sasl.login.refresh.reauthenticate.enable: Do we need this? Client knows if
broker version supports re-authentication based on the SASL_AUTHENTICATE
version returned in ApiVersionsResponse. Client knows if broker is
configured to enable re-authentication based on session lifetime returned
in SaslAuthenticateResponse. If broker has re-authentication configured and
client supports re-authentication, you would always want re-authenticate.
So I wasn't sure why we need a config to opt in or out on the client-side.


connections.max.reauth.ms: We obviously need a broker-side config. Not
entirely sure about the semantics of the config that drives
re-authentication. In particular, I wasn't expecting that we would reject
tokens or tickets simply because they were too long-lived. Since tokens or
tickets are granted by a 3rd party authority, I am not sure if clients will
always have control over the lifetime. Do we need to support any more
scenarios than these:


A) reauth.ms=10,credential.lifetime.ms=10 : Broker sets session.lifetime=10,
so this works.

B) reauth.ms=10, credential.lifetime.ms=5 : Broker sets session.lifetime=5,
so this works.
C) reauth.ms=10, credential.lifetime.ms=20 : Broker sets
session.lifetime=10. Client re-authenticates even though token was not
refreshed. Does this matter?
D) reauth.ms=Long.MAX_VALUE, credential.lifetime.ms=10: Broker sets
session.lifetime=10, client re-authenticates based on credential expiry.
E) reauth.ms=0 (default), credential.lifetime.ms=10 : Broker sets
session.lifetime=0, Broker doesn't terminate sessions, client doesn't
re-authenticate. We generate useful metrics.
F) reauth.ms=0 (default),no lifetime for credential (e.g. PLAIN): Broker
sets session.lifetime=0, Broker doesn't terminate sessions, client doesn't
re-authenticate
G) reauth.ms=10,no lifetime for credential (e.g. PLAIN) : Broker sets
session.lifetime=10. Client re-authenticates.

I would have thought that D) is the typical scenario for OAuth/Kerberos to
respect token expiry time. G) would be typical scenario for PLAIN to force
re-authenication at regular intervals. A/B/C are useful to force
re-authentication in scenarios where you might check for credential
revocation in the server. And E/F are useful to disable re-authentication
and generate metrics (also the default behaviour useful during migration).
Have I missed something?


On Mon, Sep 17, 2018 at 4:27 PM, Ron Dagostino <[email protected]> wrote:

> Hi yet again, Rajini.  I also just realized that if the client is to learn
> the credential lifetime we wouldn't want to put special-case code in the
> Authenticator for GSSAPI and OAUTHBEARER; we would want to expose the value
> generically, probably as a negotiated property on the SaslClient instance.
> We might be talking about making the negotiated property key part of the
> public API.  In other words, the SaslClient would be responsible for
> exposing the credential (i.e. token or ticket) lifetime at a well-known
> negotiated property name, such as "Credential.Lifetime" and putting a Long
> value there if there is a lifetime.  That well-klnown key (e.g.
> "Credential.Lifetime") would be part of the public API, right?
>
> Ron
>
> On Mon, Sep 17, 2018 at 11:03 AM Ron Dagostino <[email protected]> wrote:
>
> > Hi again, Rajini.  After thinking about this a little while, it occurs to
> > me that maybe the communication of max session lifetime should occur via
> > SASL_HANDSHAKE after all.  Here's why.  The value communicated is the max
> > session lifetime allowed, and the client can assume it is the the session
> > lifetime for that particular session unless the particular SASL mechanism
> > could result in a shorter session that would be obvious to the client and
> > the server.  In particular, for OAUTHBEARER, the session lifetime will be
> > the token lifetime, which the client and server will both know.  Is
> there a
> > reason to communicate this value back to the client when the client
> already
> > knows it?  It's an extra network round-trip, and since the SASL
> round-trips
> > are defined by the spec I'm not certain adding an extra round-trip is
> > acceptable.  Even if we decide we can add it, it helps with latency if we
> > don't.
> >
> > Kerberos may be a bit different -- I don't know if the broker can learn
> > the session lifetime.  If it can then the same thing holds -- both client
> > and server will know the session lifetime and there is no reason to
> > communicate it back.  If the server can't learn the lifetime then I don't
> > think adding an extra SASL_AUTHENTICATE round trip is going to help,
> anyway.
> >
> > Also, by communicating the max session lifetime in the SASL_HANDSHAKE
> > response, both OAUTHBEARER and GSSAPI clients will be able to know before
> > sending any SASL_AUTHENTICATE requests whether their credential violates
> > the maximum.  This allows a behaving client to give a good error message.
> > A malicious client would ignore the value and send a longer-lived
> > credential, and then that would be rejected on the server side.
> >
> > I'm still good with ExpiringCredential not needing to be public.
> >
> > What do you think?
> >
> > Ron
> >
> > Ron
> >
> > Ron
> >
> > On Mon, Sep 17, 2018 at 9:13 AM Ron Dagostino <[email protected]> wrote:
> >
> >> Hi Rajini.  I've updated the KIP to reflect the decision to fully
> support
> >> this for all SASL mechanisms and to not require the ExpiringCredential
> >> interface to be public.
> >>
> >> Ron
> >>
> >> On Mon, Sep 17, 2018 at 6:55 AM Ron Dagostino <[email protected]>
> wrote:
> >>
> >>> Actually, I think the metric remains the same assuming the
> >>> authentication fails when the token  lifetime is too long.  Negative
> max
> >>> config on server counts what would have been killed because maybe
> client
> >>> re-auth is not turned on; positive max enables the kill, which is
> counted
> >>> by a second metric as proposed.  The current proposal had already
> stated
> >>> that a non-zero value would disallow an authentication with a token
> that
> >>> has too large a lifetime, and that is still the case, I think.  But
> let’s
> >>> make sure we are on the same page about all this.
> >>>
> >>> Ron
> >>>
> >>> > On Sep 17, 2018, at 6:42 AM, Ron Dagostino <[email protected]>
> wrote:
> >>> >
> >>> > Hi Rajini.  I see what you are saying.  The background login refresh
> >>> thread does have to factor into the decision for OAUTHBEARER because
> there
> >>> is no new token to re-authenticate with until that refresh succeeds
> >>> (similarly with Kerberos), but I think you are right that the interface
> >>> doesn’t necessarily have to be public.  The server does decide the
> time for
> >>> PLAIN and the SCRAM-related mechanisms under the current proposal, but
> it
> >>> is done in SaslHandshakeResponse, and at that point the server won’t
> have
> >>> any token yet; making it happen via SaslAuthenticateRequest at the
> very end
> >>> does allow the server to know everything for all mechanisms.
> >>> >
> >>> > I see two potential issues to discuss.  First, the server must
> >>> communicate a time that exceeds the (token or ticket) refresh period
> on the
> >>> client.  Assuming it communicates the expiration time of the
> token/ticket,
> >>> that’s the best it can do.  So I think that’ll be fine.
> >>> >
> >>> > The second issue is what happens if the server is configured to
> accept
> >>> a max value — say, an hour — and the token is good for longer.  I
> assumed
> >>> that the client should not be allowed to authenticate in the first
> place
> >>> because it could then simply re-authenticate with the same token after
> an
> >>> hour, which defeats the motivations for this KIP.  So do we agree the
> max
> >>> token lifetime allowed will be enforced at authentication time?
> Assuming
> >>> so, then we need to discuss migration.
> >>> >
> >>> > The current proposal includes a metric that can be used to identify
> if
> >>> an OAUTHBEARER client is misconfigured — the number of connections that
> >>> would have been killed (could be non-zero when the configured max is a
> >>> negative number). Do we still want this type of an indication, and if
> so,
> >>> is it still done the same way — a negative number for max, but instead
> of
> >>> counting the number of kills that would have been done it counts the
> number
> >>> of authentications that would have been failed?
> >>> >
> >>> > Ron
> >>> >
> >>> >> On Sep 17, 2018, at 6:06 AM, Rajini Sivaram <
> [email protected]>
> >>> wrote:
> >>> >>
> >>> >> Hi Ron,
> >>> >>
> >>> >> Thinking a bit more about other SASL mechanisms, I think the issue
> is
> >>> that
> >>> >> in the current proposal, clients decide the re-authentication
> period.
> >>> For
> >>> >> mechanisms like PLAIN or SCRAM, we would actually want server to
> >>> determine
> >>> >> the re-authentication interval. For example, the PLAIN or SCRAM
> >>> database
> >>> >> could specify the expiry time for each principal, or the broker
> could
> >>> be
> >>> >> configured with a single expiry time for all principals of that
> >>> mechanism.
> >>> >> For OAuth, brokers do know the expiry time. Haven't figured out what
> >>> to do
> >>> >> with Kerberos, but in any case for broker-side connection
> termination
> >>> on
> >>> >> expiry, we need the broker to know/decide the expiry time. So I
> would
> >>> like
> >>> >> to suggest a slightly different approach to managing
> >>> re-authentications.
> >>> >>
> >>> >> 1) Instead of changing SASL_HANDSHAKE version number, we bump up
> >>> >> SASL_AUTHENTICATE version number.
> >>> >> 2) In the final SASL_AUTHENTICATE response, broker returns the
> expiry
> >>> time
> >>> >> of this authenticated session. This is the interval after which
> >>> broker will
> >>> >> terminate the connection If it wasn't re-authenticated.
> >>> >> 3) We no longer need the public interface change to add `
> >>> >> org.apache.kafka.common.security.expiring.ExpiringCredential` for
> the
> >>> >> client-side. Instead we schedule the next re-authentication on the
> >>> client
> >>> >> based on the expiry time provided by the server (some time earlier
> >>> than the
> >>> >> expiry).
> >>> >> 4) If client uses SASL_AUTHENTICATE v0, broker will not return
> expiry
> >>> time.
> >>> >> The connection will be terminated if that feature is enabled (the
> >>> same code
> >>> >> path as client failing to re-authenticte on time).
> >>> >>
> >>> >> Thoughts?
> >>> >>
> >>> >>> On Sat, Sep 15, 2018 at 3:06 AM, Ron Dagostino <[email protected]>
> >>> wrote:
> >>> >>>
> >>> >>> Hi everyone.  I've updated the KIP to reflect all discussion to
> date,
> >>> >>> including the decision to go with the low-level approach.  This
> >>> latest
> >>> >>> version of the KIP includes the above "connections.max.reauth.ms"
> >>> >>> proposal,
> >>> >>> which I know has not been discussed yet.  It mentions new metrics,
> >>> some of
> >>> >>> which may not have been discussed either (and names are missing
> from
> >>> some
> >>> >>> of them).  Regardless, this new version is the closest yet to a
> >>> version
> >>> >>> that can be put to a vote next week.
> >>> >>>
> >>> >>> Ron
> >>> >>>
> >>> >>>> On Fri, Sep 14, 2018 at 8:59 PM Ron Dagostino <[email protected]>
> >>> wrote:
> >>> >>>>
> >>> >>>> Minor correction: I'm proposing "connections.max.reauth.ms" as
> the
> >>> >>>> broker-side configuration property, not "
> >>> >>>> connections.max.expired.credentials.ms".
> >>> >>>>
> >>> >>>> Ron
> >>> >>>>
> >>> >>>>> On Fri, Sep 14, 2018 at 8:40 PM Ron Dagostino <[email protected]
> >
> >>> wrote:
> >>> >>>>>
> >>> >>>>> Hi Rajini.  I'm going to choose *
> >>> connections.max.expired.credentials.ms
> >>> >>>>> <http://connections.max.expired.credentials.ms>* as the option
> >>> name
> >>> >>>>> because it is consistent with the comment I made in the section
> >>> about
> >>> >>> how
> >>> >>>>> to get the client and server to agree on credential lifetime:
> >>> >>>>>
> >>> >>>>> "We could add a new API call so that clients could ask servers
> for
> >>> the
> >>> >>>>> lifetime they use, or we could extend the
> >>> SaslHandshakeRequest/Response
> >>> >>> API
> >>> >>>>> call to include that information in the server's response – the
> >>> client
> >>> >>>>> would then adopt that value"
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> We set the config option value on the broker (with the "
> >>> >>>>> listener.name.mechanism." prefix), and we will return the
> >>> configured
> >>> >>>>> value in the SaslHandshakeResponse (requiring a wire format
> change
> >>> in
> >>> >>>>> addition to a version bump).  The value (referred to as "X"
> below)
> >>> can
> >>> >>> be
> >>> >>>>> negative, zero, or positive and is to be interpreted as follows:
> >>> >>>>>
> >>> >>>>> X = 0: Fully disable.  The server will respond to
> >>> re-authentications,
> >>> >>> but
> >>> >>>>> it won't kill any connections due to expiration, and it won't
> track
> >>> >>> either
> >>> >>>>> of the 2 metrics mentioned below.
> >>> >>>>>
> >>> >>>>> Now, a couple of definitions for when X != 0:
> >>> >>>>>
> >>> >>>>> 1) We define the *maximum allowed expiration time* to be the time
> >>> >>>>> determined by the point when (re-)authentication occurs plus |X|
> >>> >>>>> milliseconds.
> >>> >>>>> 2) We define the *requested expiration time* to be the maximum
> >>> allowed
> >>> >>>>> expiration time except for the case of an OAuth bearer token, in
> >>> which
> >>> >>> case
> >>> >>>>> it is the point at which the token expires.  (Kerberos presumably
> >>> also
> >>> >>>>> specifies a ticket lifetime, but frankly I am not in a position
> to
> >>> do
> >>> >>> any
> >>> >>>>> Kerberos-related coding in time for a 2.1.0 release and would
> >>> prefer to
> >>> >>>>> ignore this piece of information for this KIP -- would it be
> >>> acceptable
> >>> >>> to
> >>> >>>>> have someone else add it later?).
> >>> >>>>>
> >>> >>>>> Based on these definitions, we define the behavior as follows:
> >>> >>>>>
> >>> >>>>> *X > 0: Fully enable*
> >>> >>>>> A) The server will reject any authentication/re-authentication
> >>> attempt
> >>> >>>>> when the requested expiration time is after the maximum allowed
> >>> >>> expiration
> >>> >>>>> time (which could only happen with an OAuth bearer token,
> assuming
> >>> we
> >>> >>> skip
> >>> >>>>> dealing with Kerberos for now).
> >>> >>>>> B) The server will kill connections that are used beyond the
> >>> requested
> >>> >>>>> expiration time.
> >>> >>>>> C) A broker metric will be maintained that documents the number
> >>> >>>>> connections killed by the broker.  This metric will be non-zero
> if
> >>> a
> >>> >>> client
> >>> >>>>> is connecting to the broker with re-authentication either
> >>> unavailable
> >>> >>> (i.e.
> >>> >>>>> an older client) or disabled.
> >>> >>>>>
> >>> >>>>> *X < 0: Partially enable*
> >>> >>>>> A) Same as above: the server will reject any
> >>> >>>>> authentication/re-authentication attempt when the requested
> >>> expiration
> >>> >>> time
> >>> >>>>> is after the maximum allowed expiration time (which could only
> >>> happen
> >>> >>> with
> >>> >>>>> an OAuth bearer token, assuming we skip dealing with Kerberos for
> >>> now).
> >>> >>>>> B) The server will **not** kill connections that are used beyond
> >>> the
> >>> >>>>> requested expiration time.
> >>> >>>>> C) A broker metric will be maintained that documents the number
> of
> >>> API
> >>> >>>>> requests unrelated to re-authentication that are made over a
> >>> connection
> >>> >>>>> beyond the requested expiration time.  This metric will be
> helpful
> >>> for
> >>> >>>>> operators to ensure that all clients are properly upgraded and
> >>> >>>>> re-authenticating before fully enabling the server-side
> >>> >>>>> expired-connection-kill functionality (i.e. by changing the value
> >>> from a
> >>> >>>>> negative number to a positive number): this metric will be zero
> >>> across
> >>> >>> all
> >>> >>>>> brokers when it is safe to fully enable the server-side feature.
> >>> >>>>>
> >>> >>>>> Thoughts?
> >>> >>>>>
> >>> >>>>> Ron
> >>> >>>>>
> >>> >>>>> On Fri, Sep 14, 2018 at 11:59 AM Rajini Sivaram <
> >>> >>> [email protected]>
> >>> >>>>> wrote:
> >>> >>>>>
> >>> >>>>>> It will be good to shorten the config name. We have a config
> >>> named `
> >>> >>>>>> connections.max.idle.ms`. We could add something like `
> >>> >>>>>> connections.max.expired.credentials.ms`. As you said, it would
> be
> >>> >>>>>> prefixed
> >>> >>>>>> with listener and SASL mechanism name. We should be able to
> >>> support
> >>> >>>>>> connection termination in future even with SSL, so perhaps we
> >>> don't
> >>> >>> want
> >>> >>>>>> the `sasl.server.` prefix (we just need to validate whether the
> >>> config
> >>> >>> is
> >>> >>>>>> supported for the protocol). You can choose whether a boolean
> >>> flag or a
> >>> >>>>>> time interval makes more sense.
> >>> >>>>>>
> >>> >>>>>> For the client-side, the KIP explains how to make it work for
> >>> other
> >>> >>>>>> mechanisms and we can leave it that. Not convinced about
> >>> server-side
> >>> >>>>>> though. It seems to me that we probably would require API
> changes
> >>> to
> >>> >>> make
> >>> >>>>>> it work. Basically the proposed approach works only for
> >>> OAUTHBEARER.
> >>> >>>>>> Since
> >>> >>>>>> we have to bump up SaslHandshakeRequest version in this KIP, it
> >>> will be
> >>> >>>>>> good to work out if we need to change the request or flow to
> make
> >>> this
> >>> >>>>>> work
> >>> >>>>>> for other mechanisms. I haven't figured out exactly what is
> >>> needed, but
> >>> >>>>>> will think about it and get back next week. In the meantime, you
> >>> can
> >>> >>> get
> >>> >>>>>> the KIP up-to-date with the config, migration path etc. and get
> it
> >>> >>> ready
> >>> >>>>>> to
> >>> >>>>>> start vote next week.
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>>
> >>> >>>>>> On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino <
> [email protected]
> >>> >
> >>> >>>>>> wrote:
> >>> >>>>>>
> >>> >>>>>>> Hi Rajini.
> >>> >>>>>>>
> >>> >>>>>>> Agreed, we will bump the SaslHandshakeRequest API number (no
> wire
> >>> >>>>>> format
> >>> >>>>>>> change, of course).
> >>> >>>>>>>
> >>> >>>>>>> Agreed, responding to re-authentication will always be enabled
> >>> on the
> >>> >>>>>>> broker since it is initiated by the client.
> >>> >>>>>>>
> >>> >>>>>>> Agreed, connection termination by the broker will be opt-in.
> I'm
> >>> >>>>>> thinking
> >>> >>>>>>> *sasl.server.disconnect.expired.credential.connections.
> enable*,
> >>> >>>>>> prefixed
> >>> >>>>>>> with listener and SASL mechanism name in lower-case so it can
> be
> >>> >>>>>> opt-in at
> >>> >>>>>>> the granularity of a SASL mechanism; for example, "
> >>> >>>>>>> listener.name.sasl_ssl.oauthbearer.sasl.server.
> >>> >>>>>>> disconnect.expired.credential.connections.enable=true
> >>> >>>>>>> ".  It is long, so if you have a preference for a shorter name
> >>> let me
> >>> >>>>>> know
> >>> >>>>>>> and we'll go with it instead.
> >>> >>>>>>>
> >>> >>>>>>> Agreed, additional metrics will be helpful.  I'll add them to
> the
> >>> >>> KIP.
> >>> >>>>>>>
> >>> >>>>>>> <<<need to work out exactly how this works with all SASL
> >>> mechanisms
> >>> >>>>>>> <<<not sure we have the data required on the broker or a way of
> >>> >>>>>>> <<< extending the [GSSAPI] mechanism
> >>> >>>>>>> Not sure I agree here.  The paragraph I added to the KIP
> >>> describes
> >>> >>> how
> >>> >>>>>> I
> >>> >>>>>>> think this can be done.  Given that description, do you still
> >>> feel
> >>> >>>>>> Kerberos
> >>> >>>>>>> will not be possible?  If Kerberos presents a significant
> hurdle
> >>> >>> then I
> >>> >>>>>>> don't think that should prevent us from doing it for other
> >>> mechanisms
> >>> >>>>>> -- I
> >>> >>>>>>> would rather state that it isn't supported with GSSAPI due to
> >>> >>>>>> limitations
> >>> >>>>>>> in Kerberos itself than not have it for OAUTHBEARER, for
> example.
> >>> >>> And
> >>> >>>>>>> right now I don't think we need more than a high-level
> >>> description of
> >>> >>>>>> how
> >>> >>>>>>> this could be done.  In other words, I think we have this
> >>> covered,
> >>> >>>>>> unless
> >>> >>>>>>> there is a fundamental problem due to Kerberos not making data
> >>> (i.e.
> >>> >>>>>> ticket
> >>> >>>>>>> expiration) available on the server.   I want to submit this
> KIP
> >>> for
> >>> >>> a
> >>> >>>>>> vote
> >>> >>>>>>> early next week in the hope of having it accepted by the
> Monday,
> >>> 9/24
> >>> >>>>>>> deadline for the 2.1.0 release, and if that happens I believe I
> >>> can
> >>> >>>>>> get a
> >>> >>>>>>> PR into really good shape soon thereafter (I'm working on it
> >>> now).
> >>> >>>>>>>
> >>> >>>>>>> Ron
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <
> >>> >>>>>> [email protected]>
> >>> >>>>>>> wrote:
> >>> >>>>>>>
> >>> >>>>>>>> Hi Ron,
> >>> >>>>>>>>
> >>> >>>>>>>> 1) Yes, we should update the version of
> `SaslHandshakeRequest`.
> >>> >>>>>> Clients
> >>> >>>>>>>> would attempt to re-authenticate only if broker's
> >>> >>>>>> SaslHandshakeRequest
> >>> >>>>>>>> version >=2.
> >>> >>>>>>>> 2) I think we should enable broker's re-authentication and
> >>> >>> connection
> >>> >>>>>>>> termination code regardless of client version.
> Re-authentication
> >>> >>>>>> could be
> >>> >>>>>>>> always enabled since it is triggered only by clients and
> broker
> >>> >>> could
> >>> >>>>>>> just
> >>> >>>>>>>> handle it. Connection termination should be opt-in, but should
> >>> work
> >>> >>>>>> with
> >>> >>>>>>>> clients of any version. If turned on and clients are of an
> older
> >>> >>>>>> version,
> >>> >>>>>>>> this would simply force reconnection, which should be ok.
> >>> >>> Additional
> >>> >>>>>>>> metrics to monitor expired connections may be useful anyway.
> >>> >>>>>>>>
> >>> >>>>>>>> We also need to work out exactly how this works with all SASL
> >>> >>>>>> mechanisms.
> >>> >>>>>>>> In particular, we have ensure that we can make it work with
> >>> >>> Kerberos
> >>> >>>>>>> since
> >>> >>>>>>>> we use the implementation from the JRE and I am not sure we
> have
> >>> >>> the
> >>> >>>>>> data
> >>> >>>>>>>> required on the broker or a way of extending the mechanism.
> >>> >>>>>>>>
> >>> >>>>>>>> Thoughts?
> >>> >>>>>>>>
> >>> >>>>>>>> On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <
> >>> [email protected]>
> >>> >>>>>>> wrote:
> >>> >>>>>>>>
> >>> >>>>>>>>> Hi Rajini.  I'm thinking about how we deal with migration.
> >>>  For
> >>> >>>>>>> example,
> >>> >>>>>>>>> let's say we have an existing 2.0.0 cluster using
> >>> >>> SASL/OAUTHBEARER
> >>> >>>>>> and
> >>> >>>>>>> we
> >>> >>>>>>>>> want to use this feature.  The desired end state is to have
> all
> >>> >>>>>> clients
> >>> >>>>>>>> and
> >>> >>>>>>>>> brokers migrated to a version that supports the feature (call
> >>> it
> >>> >>>>>> 2.x)
> >>> >>>>>>>> with
> >>> >>>>>>>>> the feature turned on.  We need to document the supported
> >>> path(s)
> >>> >>>>>> to
> >>> >>>>>>> this
> >>> >>>>>>>>> end state.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Here are a couple of scenarios with implications:
> >>> >>>>>>>>>
> >>> >>>>>>>>> 1) *Migrate client to 2.x and turn the feature on but broker
> is
> >>> >>>>>> still
> >>> >>>>>>> at
> >>> >>>>>>>>> 2.0.*  In this case the client is going to get an error when
> it
> >>> >>>>>> sends
> >>> >>>>>>> the
> >>> >>>>>>>>> SaslHandshakeRequest.  We want to avoid this.  It seems to me
> >>> the
> >>> >>>>>>> client
> >>> >>>>>>>>> needs to know if the broker has been upgraded to the
> necessary
> >>> >>>>>> version
> >>> >>>>>>>>> before trying to re-authenticate, which makes me believe we
> >>> need
> >>> >>> to
> >>> >>>>>>> bump
> >>> >>>>>>>>> the API version number in 2.x and the client has to check to
> >>> see
> >>> >>>>>> if the
> >>> >>>>>>>> max
> >>> >>>>>>>>> version the broker supports meets a minimum standard before
> >>> >>> trying
> >>> >>>>>> to
> >>> >>>>>>>>> re-authenticate.  Do you agree?
> >>> >>>>>>>>>
> >>> >>>>>>>>> 2) *Migrate broker to 2.x and turn the feature on but client
> is
> >>> >>>>>> still
> >>> >>>>>>> at
> >>> >>>>>>>>> 2.0*.  In this case the broker is going to end up killing
> >>> >>>>>> connections
> >>> >>>>>>>>> periodically.   Again, we want to avoid this.  It's tempting
> to
> >>> >>> say
> >>> >>>>>>>> "don't
> >>> >>>>>>>>> do this" but I wonder if we can require installations to
> >>> upgrade
> >>> >>>>>> all
> >>> >>>>>>>>> clients before turning the feature on at the brokers.
> >>> Certainly
> >>> >>>>>> we can
> >>> >>>>>>>> say
> >>> >>>>>>>>> "don't do this" for inter-broker clients -- in other words,
> we
> >>> >>> can
> >>> >>>>>> say
> >>> >>>>>>>> that
> >>> >>>>>>>>> all brokers in a cluster should be upgraded before the
> feature
> >>> is
> >>> >>>>>>> turned
> >>> >>>>>>>> on
> >>> >>>>>>>>> for any one of them -- but I don't know about our ability to
> >>> say
> >>> >>>>>> it for
> >>> >>>>>>>>> non-broker clients.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Now we consider the cases where both the brokers and the
> >>> clients
> >>> >>>>>> have
> >>> >>>>>>>> been
> >>> >>>>>>>>> upgraded to 2.x.  When and where should the feature be turned
> >>> on?
> >>> >>>>>> The
> >>> >>>>>>>>> inter-broker case is interesting because I don't think think
> we
> >>> >>> can
> >>> >>>>>>>> require
> >>> >>>>>>>>> installations to restart every broker with a new config where
> >>> the
> >>> >>>>>>> feature
> >>> >>>>>>>>> is turned on at the same time.  Do you agree that we cannot
> >>> >>> require
> >>> >>>>>>> this
> >>> >>>>>>>>> and therefore must support installations restarting brokers
> >>> with
> >>> >>> a
> >>> >>>>>> new
> >>> >>>>>>>>> config at whatever pace they need -- which may be quite slow
> >>> >>>>>> relative
> >>> >>>>>>> to
> >>> >>>>>>>>> token lifetimes?  Assuming you do agree, then there is going
> to
> >>> >>> be
> >>> >>>>>> the
> >>> >>>>>>>> case
> >>> >>>>>>>>> where some brokers are going to have the feature turned on
> and
> >>> >>> some
> >>> >>>>>>>> clients
> >>> >>>>>>>>> (definitely inter-broker clients at a minimum) are going to
> >>> have
> >>> >>>>>> the
> >>> >>>>>>>>> feature turned off.
> >>> >>>>>>>>>
> >>> >>>>>>>>> The above discussion assumes a single config on the broker
> side
> >>> >>>>>> that
> >>> >>>>>>>> turns
> >>> >>>>>>>>> on both the inter-broker client re-authentication feature as
> >>> well
> >>> >>>>>> as
> >>> >>>>>>> the
> >>> >>>>>>>>> server-side expired-connection-kill feature, but now I'm
> >>> thinking
> >>> >>>>>> we
> >>> >>>>>>> need
> >>> >>>>>>>>> the ability to turn these features on independently, plus
> maybe
> >>> >>> we
> >>> >>>>>>> need a
> >>> >>>>>>>>> way to monitor the number of "active, expired" connections to
> >>> the
> >>> >>>>>>> server
> >>> >>>>>>>> so
> >>> >>>>>>>>> that operators can be sure that all clients have been
> >>> >>>>>> upgraded/enabled
> >>> >>>>>>>>> before turning on the server-side expired-connection-kill
> >>> >>> feature.
> >>> >>>>>>>>>
> >>> >>>>>>>>> So the migration plan would be as follows:
> >>> >>>>>>>>>
> >>> >>>>>>>>> 1) Upgrade all brokers to 2.x.
> >>> >>>>>>>>> 2) After all brokers are upgraded, turn on re-authentication
> >>> for
> >>> >>>>>> all
> >>> >>>>>>>>> brokers at whatever rate is desired -- just eventually, at
> some
> >>> >>>>>> point,
> >>> >>>>>>>> get
> >>> >>>>>>>>> the client-side feature turned on for all brokers so that
> >>> >>>>>> inter-broker
> >>> >>>>>>>>> connections are re-authenticating.
> >>> >>>>>>>>> 3) In parallel with (1) and (2) above, upgrade clients to 2.x
> >>> and
> >>> >>>>>> turn
> >>> >>>>>>>>> their re-authentication feature on.  Clients will check the
> API
> >>> >>>>>> version
> >>> >>>>>>>> and
> >>> >>>>>>>>> only re-authenticate to a broker that has also been upgraded
> >>> >>> (note
> >>> >>>>>> that
> >>> >>>>>>>> the
> >>> >>>>>>>>> ability of a broker to respond to a re-authentication cannot
> be
> >>> >>>>>> turned
> >>> >>>>>>>> off
> >>> >>>>>>>>> -- it is always on beginning with version 2.x, and it just
> sits
> >>> >>>>>> there
> >>> >>>>>>>> doing
> >>> >>>>>>>>> nothing if it isn't exercised by an enabled client)
> >>> >>>>>>>>> 4) After (1), (2), and (3) are complete, check the 2.x broker
> >>> >>>>>> metrics
> >>> >>>>>>> to
> >>> >>>>>>>>> confirm that there are no "active, expired" connections.
> Once
> >>> >>> you
> >>> >>>>>> are
> >>> >>>>>>>>> satisfied that (1), (2), and (3) are indeed complete you can
> >>> >>>>>> enable the
> >>> >>>>>>>>> server-side expired-connection-kill feature on each broker
> via
> >>> a
> >>> >>>>>>> restart
> >>> >>>>>>>> at
> >>> >>>>>>>>> whatever pace is desired.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Comments?
> >>> >>>>>>>>>
> >>> >>>>>>>>> Ron
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>> On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <
> >>> [email protected]
> >>> >>>>
> >>> >>>>>>> wrote:
> >>> >>>>>>>>>
> >>> >>>>>>>>>> Ok, I am tempted to just say we go with the low-level
> approach
> >>> >>>>>> since
> >>> >>>>>>> it
> >>> >>>>>>>>> is
> >>> >>>>>>>>>> the quickest and seems to meet the clear requirements. We
> can
> >>> >>>>>> always
> >>> >>>>>>>>> adjust
> >>> >>>>>>>>>> later if we get to clarity on other requirements or we
> decide
> >>> >>> we
> >>> >>>>>> need
> >>> >>>>>>>> to
> >>> >>>>>>>>>> approach it differently for whatever reason.  But in the
> >>> >>>>>> meantime,
> >>> >>>>>>>> before
> >>> >>>>>>>>>> fully committing to this decision, I would appreciate
> another
> >>> >>>>>>>> perspective
> >>> >>>>>>>>>> if someone has one.
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> Ron
> >>> >>>>>>>>>>
> >>> >>>>>>>>>> On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <
> >>> >>>>>>>> [email protected]>
> >>> >>>>>>>>>> wrote:
> >>> >>>>>>>>>>
> >>> >>>>>>>>>>> Hi Ron,
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> Yes, I would leave out retries from this KIP for now. In
> the
> >>> >>>>>> future,
> >>> >>>>>>>> if
> >>> >>>>>>>>>>> there is a requirement for supporting retries, we can
> >>> consider
> >>> >>>>>> it. I
> >>> >>>>>>>>> think
> >>> >>>>>>>>>>> we can support retries with either approach if we needed
> to,
> >>> >>>>>> but it
> >>> >>>>>>>>> would
> >>> >>>>>>>>>>> be better to do it along with other changes required to
> >>> >>> support
> >>> >>>>>>>>>>> authentication servers that are not highly available.
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> For maintainability, I am biased, so it will be good to get
> >>> >>> the
> >>> >>>>>>>>>>> perspective
> >>> >>>>>>>>>>> of others in the community :-)
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <
> >>> >>>>>> [email protected]>
> >>> >>>>>>>>> wrote:
> >>> >>>>>>>>>>>
> >>> >>>>>>>>>>>> Hi Rajini.  Here is some feedback/some things I thought
> of.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> First, I just realized that from a timing perspective
> that I
> >>> >>>>>> am
> >>> >>>>>>> not
> >>> >>>>>>>>> sure
> >>> >>>>>>>>>>>> retry is going to be necessary.  The background login
> >>> >>> refresh
> >>> >>>>>>> thread
> >>> >>>>>>>>>>>> triggers re-authentication when it refreshes the client's
> >>> >>>>>>>> credential.
> >>> >>>>>>>>>>> The
> >>> >>>>>>>>>>>> OAuth infrastructure has to be available in order for this
> >>> >>>>>> refresh
> >>> >>>>>>>> to
> >>> >>>>>>>>>>>> succeed (the background thread repeatedly retries if it
> >>> >>> can't
> >>> >>>>>>>> refresh
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>> credential, and that retry loop handles any temporary
> >>> >>>>>> outage).  If
> >>> >>>>>>>>>>> clients
> >>> >>>>>>>>>>>> are told to re-authenticate when the credential is
> refreshed
> >>> >>>>>> **and
> >>> >>>>>>>>> they
> >>> >>>>>>>>>>>> actually re-authenticate at that point** then it is highly
> >>> >>>>>>> unlikely
> >>> >>>>>>>>> that
> >>> >>>>>>>>>>>> the OAuth infrastructure would fail within those
> intervening
> >>> >>>>>>>>>>> milliseconds.
> >>> >>>>>>>>>>>> So we don't need re-authentication retry in this KIP as
> long
> >>> >>>>>> as
> >>> >>>>>>>> retry
> >>> >>>>>>>>>>>> starts immediately.  The low-level prototype as currently
> >>> >>>>>> coded
> >>> >>>>>>>>> doesn't
> >>> >>>>>>>>>>>> actually start re-authentication until the connection is
> >>> >>>>>>>> subsequently
> >>> >>>>>>>>>>> used,
> >>> >>>>>>>>>>>> and that could take a while.  But then again, if the
> >>> >>>>>> connection
> >>> >>>>>>>> isn't
> >>> >>>>>>>>>>> used
> >>> >>>>>>>>>>>> for some period of time, then the lost value of having to
> >>> >>>>>> abandon
> >>> >>>>>>>> the
> >>> >>>>>>>>>>>> connection is lessened anyway.  Plus, as you pointed out,
> >>> >>>>>> OAuth
> >>> >>>>>>>>>>>> infrastructure is assumed to be highly available.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> Does this makes sense, and would you be willing to say
> that
> >>> >>>>>> retry
> >>> >>>>>>>>> isn't
> >>> >>>>>>>>>>> a
> >>> >>>>>>>>>>>> necessary requirement?  I'm tempted but would like to hear
> >>> >>>>>> your
> >>> >>>>>>>>>>> perspective
> >>> >>>>>>>>>>>> first.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> Interleaving/latency and maintainability (more than lines
> of
> >>> >>>>>> code)
> >>> >>>>>>>> are
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>> two remaining areas of comparison.  I did not add the
> >>> >>>>>>>> maintainability
> >>> >>>>>>>>>>> issue
> >>> >>>>>>>>>>>> to the KIP yet, but before I do I thought I would address
> it
> >>> >>>>>> here
> >>> >>>>>>>>> first
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>> see if we can come to consensus on it because I'm not
> sure I
> >>> >>>>>> see
> >>> >>>>>>> the
> >>> >>>>>>>>>>>> high-level approach as being hard to maintain (yet -- I
> >>> >>> could
> >>> >>>>>> be
> >>> >>>>>>>>>>>> convinced/convince myself; we'll see).  I do want to make
> >>> >>>>>> sure we
> >>> >>>>>>>> are
> >>> >>>>>>>>> on
> >>> >>>>>>>>>>>> the same page about what is required to add
> >>> >>> re-authentication
> >>> >>>>>>>> support
> >>> >>>>>>>>> in
> >>> >>>>>>>>>>>> the high-level case.  Granted, the amount is zero in the
> >>> >>>>>> low-level
> >>> >>>>>>>>> case,
> >>> >>>>>>>>>>>> and it doesn't get any better that that, but the amount in
> >>> >>> the
> >>> >>>>>>>>>>> high-level
> >>> >>>>>>>>>>>> case is very low -- just a few lines of code.  For
> example:
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> KafkaAdminClient:
> >>> >>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/commits/
> >>> >>> 4fa70f38b9
> >>> >>>>>>>>>>>> d33428ff98b64a3a2bd668f5f28c38#diff-
> >>> >>>>>>> 6869b8fccf6b098cbcb0676e8ceb26a7
> >>> >>>>>>>>>>>> It is the same few lines of code for KafkaConsumer,
> >>> >>>>>> KafkaProducer,
> >>> >>>>>>>>>>>> WorkerGroupMember, and TransactionMarkerChannelManager
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> The two synchronous I/O use cases are
> >>> >>>>>> ControllerChannelManager and
> >>> >>>>>>>>>>>> ReplicaFetcherBlockingSend (via ReplicaFetcherThread), and
> >>> >>>>>> they
> >>> >>>>>>>>> require
> >>> >>>>>>>>>>> a
> >>> >>>>>>>>>>>> little bit more -- but not much.
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> Thoughts?
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> Ron
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <
> >>> >>>>>> [email protected]>
> >>> >>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>
> >>> >>>>>>>>>>>>> Thanks, Rajini.  Before I digest/respond to that, here's
> >>> >>> an
> >>> >>>>>>> update
> >>> >>>>>>>>>>> that I
> >>> >>>>>>>>>>>>> just completed.
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>> I added a commit to the PR (
> >>> >>>>>>>>>>> https://github.com/apache/kafka/pull/5582/)
> >>> >>>>>>>>>>>>> that implements server-side kill of expired OAUTHBEARER
> >>> >>>>>>>> connections.
> >>> >>>>>>>>>>> No
> >>> >>>>>>>>>>>>> tests yet since we still haven't settled on a final
> >>> >>> approach
> >>> >>>>>>>>>>> (low-level
> >>> >>>>>>>>>>>> vs.
> >>> >>>>>>>>>>>>> high-level).
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>> I also updated the KIP to reflect the latest discussion
> >>> >>> and
> >>> >>>>>> PR
> >>> >>>>>>> as
> >>> >>>>>>>>>>>> follows:
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>  - Include support for brokers killing connections as
> >>> >>>>>> part of
> >>> >>>>>>>> this
> >>> >>>>>>>>>>> KIP
> >>> >>>>>>>>>>>>>  (rather than deferring it to a future KIP as was
> >>> >>>>>> originally
> >>> >>>>>>>>>>>> mentioned; the
> >>> >>>>>>>>>>>>>  PR now includes it as mentioned -- it was very easy to
> >>> >>>>>> add)
> >>> >>>>>>>>>>>>>  - Added metrics (they will mirror existing ones related
> >>> >>>>>> to
> >>> >>>>>>>>>>>>>  authentications; I have not added those to the PR)
> >>> >>>>>>>>>>>>>  - Updated the implementation description to reflect the
> >>> >>>>>>> current
> >>> >>>>>>>>>>> state
> >>> >>>>>>>>>>>>>  of the PR, which is a high-level, one-size-fits-all
> >>> >>>>>> approach
> >>> >>>>>>>> (as
> >>> >>>>>>>>>>>> opposed to
> >>> >>>>>>>>>>>>>  my initial, even-higher-level approach)
> >>> >>>>>>>>>>>>>  - Added a "Rejected Alternative" for the first version
> >>> >>>>>> of the
> >>> >>>>>>>> PR,
> >>> >>>>>>>>>>>>>  which injected requests directly into synchronous I/O
> >>> >>>>>>> clients'
> >>> >>>>>>>>>>> queues
> >>> >>>>>>>>>>>>>  - Added a "Rejected Alternative" for the low-level
> >>> >>>>>> approach
> >>> >>>>>>> as
> >>> >>>>>>>>>>>>>  suggested, but of course we have not formally decided
> >>> >>> to
> >>> >>>>>>> reject
> >>> >>>>>>>>>>> this
> >>> >>>>>>>>>>>>>  approach or adopt the current PR implementation.
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>> I'll think about where we stand some more before
> >>> >>> responding
> >>> >>>>>>> again.
> >>> >>>>>>>>>>>> Thanks
> >>> >>>>>>>>>>>>> for the above reply.
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>> Ron
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>> On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
> >>> >>>>>>>>>>> [email protected]>
> >>> >>>>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> Hi Ron,
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> Thank you for summarising, I think it covers the
> >>> >>>>>> differences
> >>> >>>>>>>>> between
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>> two approaches well.
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> A few minor points to answer the questions in there:
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> 1) When re-authetication is initiated in the Selector
> >>> >>>>>> during
> >>> >>>>>>>>> poll(),
> >>> >>>>>>>>>>> we
> >>> >>>>>>>>>>>>>> can
> >>> >>>>>>>>>>>>>> move an idle channel to re-authentication state. It is
> >>> >>>>>> similar
> >>> >>>>>>> to
> >>> >>>>>>>>>>>>>> injecting
> >>> >>>>>>>>>>>>>> requests, but achieved by changing channel back to
> >>> >>>>>>> authenticating
> >>> >>>>>>>>>>> state.
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> 3) To clarify why I think re-authentication should fit
> in
> >>> >>>>>> with
> >>> >>>>>>>> our
> >>> >>>>>>>>>>>>>> authentication design: My point was not about a specific
> >>> >>>>>>>> connection
> >>> >>>>>>>>>>>> being
> >>> >>>>>>>>>>>>>> usable or not usable. It was about what happens at the
> >>> >>>>>> client
> >>> >>>>>>> API
> >>> >>>>>>>>>>> level.
> >>> >>>>>>>>>>>>>> Our client API (producer/consumer/admin client etc.)
> >>> >>>>>> currently
> >>> >>>>>>>>> assume
> >>> >>>>>>>>>>>> that
> >>> >>>>>>>>>>>>>> a single broker authentication failure is a fatal error
> >>> >>>>>> that is
> >>> >>>>>>>>> never
> >>> >>>>>>>>>>>>>> retried because we assume that broker only ever fails an
> >>> >>>>>>>>>>> authentication
> >>> >>>>>>>>>>>>>> request if credentials are invalid. If we ever decide to
> >>> >>>>>>> support
> >>> >>>>>>>>>>> cases
> >>> >>>>>>>>>>>>>> where broker occasionally fails an authentication
> request
> >>> >>>>>> due
> >>> >>>>>>> to
> >>> >>>>>>>> a
> >>> >>>>>>>>>>>>>> transient failure, we need to do more around how we
> >>> >>> handle
> >>> >>>>>>>>>>>> authentication
> >>> >>>>>>>>>>>>>> failures in clients. We may decide that it is ok to
> close
> >>> >>>>>> the
> >>> >>>>>>>>>>> connection
> >>> >>>>>>>>>>>>>> for authentication and not for re-authentication as you
> >>> >>>>>>>> mentioned,
> >>> >>>>>>>>>>> but
> >>> >>>>>>>>>>>> we
> >>> >>>>>>>>>>>>>> need to change the way this disconnection is handled by
> >>> >>>>>>> clients.
> >>> >>>>>>>> So
> >>> >>>>>>>>>>> IMO,
> >>> >>>>>>>>>>>>>> we
> >>> >>>>>>>>>>>>>> should either add support for transient retriable
> >>> >>>>>>> authentication
> >>> >>>>>>>>>>>> failures
> >>> >>>>>>>>>>>>>> properly or not retry for any scenario. Personally, I
> >>> >>> don't
> >>> >>>>>>> think
> >>> >>>>>>>>> we
> >>> >>>>>>>>>>>> would
> >>> >>>>>>>>>>>>>> want to retry all authentication failures even if it is
> a
> >>> >>>>>>>>>>>>>> re-authentication, I think we could (at some point in
> >>> >>>>>> future),
> >>> >>>>>>>>> allow
> >>> >>>>>>>>>>>>>> brokers to return an error code that indicates that it
> >>> >>> is a
> >>> >>>>>>>>> transient
> >>> >>>>>>>>>>>>>> broker-side failure rather than invalid credentials and
> >>> >>>>>> handle
> >>> >>>>>>>> the
> >>> >>>>>>>>>>> error
> >>> >>>>>>>>>>>>>> differently. I see no reason at that point why we
> >>> >>> wouldn't
> >>> >>>>>>> handle
> >>> >>>>>>>>>>>>>> authentication and re-authentication in the same way.
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> 4) As you said, the high-level approach would be bigger
> >>> >>>>>> than
> >>> >>>>>>> the
> >>> >>>>>>>>>>>> low-level
> >>> >>>>>>>>>>>>>> approach in terms of LOC. But I wouldn't be too
> concerned
> >>> >>>>>> about
> >>> >>>>>>>>>>> lines of
> >>> >>>>>>>>>>>>>> code. My bigger concern was about modularity. Our
> >>> >>> security
> >>> >>>>>> code
> >>> >>>>>>>> is
> >>> >>>>>>>>>>>> already
> >>> >>>>>>>>>>>>>> complex, protocols like Kerberos and SSL that we use
> from
> >>> >>>>>> the
> >>> >>>>>>> JRE
> >>> >>>>>>>>>>> make
> >>> >>>>>>>>>>>>>> problem diagnosis hard. Async I/O makes the networking
> >>> >>> code
> >>> >>>>>>>>> complex.
> >>> >>>>>>>>>>> You
> >>> >>>>>>>>>>>>>> need to understand networking layer to work with the
> >>> >>>>>> security
> >>> >>>>>>>>> layer,
> >>> >>>>>>>>>>> but
> >>> >>>>>>>>>>>>>> the rest of the code base doesn't rely on knowledge of
> >>> >>>>>>>>>>> network/security
> >>> >>>>>>>>>>>>>> layers. My main concern about the high-level approach is
> >>> >>>>>> that
> >>> >>>>>>> it
> >>> >>>>>>>>>>> spans
> >>> >>>>>>>>>>>>>> these boundaries, making it harder to maintain in the
> >>> >>> long
> >>> >>>>>> run.
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski <
> >>> >>>>>>>>>>>>>> [email protected]> wrote:
> >>> >>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> Hi Ron, Rajini
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> Thanks for summarizing the discussion so far, Ron!
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> 1) How often do we have such long-lived connection
> >>> >>>>>> idleness
> >>> >>>>>>>> (e.g
> >>> >>>>>>>>>>> 5-10
> >>> >>>>>>>>>>>>>>> minutes) in practice?
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> 3) I agree that retries for re-authentication are
> >>> >>> useful.
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> 4) The interleaving of requests sounds like a great
> >>> >>>>>> feature
> >>> >>>>>>> to
> >>> >>>>>>>>>>> have,
> >>> >>>>>>>>>>>> but
> >>> >>>>>>>>>>>>>>> the tradeoff against code complexity is a tough one. I
> >>> >>>>>> would
> >>> >>>>>>>>>>>> personally
> >>> >>>>>>>>>>>>>> go
> >>> >>>>>>>>>>>>>>> with the simpler approach since you could always add
> >>> >>>>>>>> interleaving
> >>> >>>>>>>>>>> on
> >>> >>>>>>>>>>>>>> top if
> >>> >>>>>>>>>>>>>>> the community decides the latency should be better.
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> Best,
> >>> >>>>>>>>>>>>>>> Stanislav
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>> On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <
> >>> >>>>>>>> [email protected]
> >>> >>>>>>>>>>
> >>> >>>>>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Hi everyone.  I've updated the PR to reflect the
> >>> >>> latest
> >>> >>>>>>>>>>> conclusions
> >>> >>>>>>>>>>>>>> from
> >>> >>>>>>>>>>>>>>>> this ongoing discussion.  The KIP still needs the
> >>> >>>>>> suggested
> >>> >>>>>>>>>>>> updates; I
> >>> >>>>>>>>>>>>>>> will
> >>> >>>>>>>>>>>>>>>> do that later this week.  II agree with Rajini that
> >>> >>>>>> some
> >>> >>>>>>>>>>> additional
> >>> >>>>>>>>>>>>>>>> feedback from the community at large would be very
> >>> >>>>>> helpful
> >>> >>>>>>> at
> >>> >>>>>>>>>>> this
> >>> >>>>>>>>>>>>>> point
> >>> >>>>>>>>>>>>>>> in
> >>> >>>>>>>>>>>>>>>> time.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Here's where we stand.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> We have 2 separate implementations for
> >>> >>>>>> re-authenticating
> >>> >>>>>>>>> existing
> >>> >>>>>>>>>>>>>>>> connections -- a so-called "low-level" approach and a
> >>> >>>>>>>>> (relatively
> >>> >>>>>>>>>>>>>>> speaking)
> >>> >>>>>>>>>>>>>>>> "high-level" approach -- and we agree on how they
> >>> >>>>>> should be
> >>> >>>>>>>>>>>> compared.
> >>> >>>>>>>>>>>>>>>> Specifically, Rajini has provided a mechanism that
> >>> >>>>>> works
> >>> >>>>>>> at a
> >>> >>>>>>>>>>>>>> relatively
> >>> >>>>>>>>>>>>>>>> low level in the stack by intercepting network sends
> >>> >>>>>> and
> >>> >>>>>>>>> queueing
> >>> >>>>>>>>>>>>>> them up
> >>> >>>>>>>>>>>>>>>> while re-authentication happens; then the queued
> >>> >>> sends
> >>> >>>>>> are
> >>> >>>>>>>> sent
> >>> >>>>>>>>>>>> after
> >>> >>>>>>>>>>>>>>>> re-authentication succeeds, or the connection is
> >>> >>>>>> closed if
> >>> >>>>>>>>>>>>>>>> re-authentication fails.  See the prototype commit at
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> https://github.com/rajinisivaram/kafka/commit/
> >>> >>>>>>> b9d711907ad843
> >>> >>>>>>>>>>>>>>> c11d17e80d6743bfb1d4e3f3fd
> >>> >>>>>>>>>>>>>>>> .
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> I have an implementation that works higher up in the
> >>> >>>>>> stack;
> >>> >>>>>>>> it
> >>> >>>>>>>>>>>> injects
> >>> >>>>>>>>>>>>>>>> authentication requests into the KafkaClient via a
> >>> >>> new
> >>> >>>>>>> method
> >>> >>>>>>>>>>> added
> >>> >>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>> that
> >>> >>>>>>>>>>>>>>>> interface, and the implementation (i.e.
> >>> >>> NetworkClient)
> >>> >>>>>>> sends
> >>> >>>>>>>>> them
> >>> >>>>>>>>>>>> when
> >>> >>>>>>>>>>>>>>>> poll() is called.  The updated PR is available at
> >>> >>>>>>>>>>>>>>>> https://github.com/apache/kafka/pull/5582/.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Here is how we think these two approaches should be
> >>> >>>>>>> compared.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> 1) *When re-authentication starts*.  The low-level
> >>> >>>>>> approach
> >>> >>>>>>>>>>>> initiates
> >>> >>>>>>>>>>>>>>>> re-authentication only if/when the connection is
> >>> >>>>>> actually
> >>> >>>>>>>> used,
> >>> >>>>>>>>>>> so
> >>> >>>>>>>>>>>> it
> >>> >>>>>>>>>>>>>> may
> >>> >>>>>>>>>>>>>>>> start after the existing credential expires; the
> >>> >>>>>> current PR
> >>> >>>>>>>>>>>>>>> implementation
> >>> >>>>>>>>>>>>>>>> injects re-authentication requests into the existing
> >>> >>>>>> flow,
> >>> >>>>>>>> and
> >>> >>>>>>>>>>>>>>>> re-authentication starts immediately regardless of
> >>> >>>>>> whether
> >>> >>>>>>> or
> >>> >>>>>>>>> not
> >>> >>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> connection is being used for something else.  Rajini
> >>> >>>>>>> believes
> >>> >>>>>>>>> the
> >>> >>>>>>>>>>>>>>> low-level
> >>> >>>>>>>>>>>>>>>> approach can be adjusted to re-authenticate idle
> >>> >>>>>>> connections
> >>> >>>>>>>>>>>> (Rajini,
> >>> >>>>>>>>>>>>>> I
> >>> >>>>>>>>>>>>>>>> don't see how this can be done without injecting
> >>> >>>>>> requests,
> >>> >>>>>>>>> which
> >>> >>>>>>>>>>> is
> >>> >>>>>>>>>>>>>> what
> >>> >>>>>>>>>>>>>>>> the high-level connection is doing, but I am probably
> >>> >>>>>>> missing
> >>> >>>>>>>>>>>>>> something
> >>> >>>>>>>>>>>>>>> --
> >>> >>>>>>>>>>>>>>>> no need to go into it unless it is easy to explain.)
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> 2) *When requests not related to re-authentication
> >>> >>> are
> >>> >>>>>> able
> >>> >>>>>>>> to
> >>> >>>>>>>>>>> use
> >>> >>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> re-authenticating connection*.  The low-level
> >>> >>> approach
> >>> >>>>>>>> finishes
> >>> >>>>>>>>>>>>>>>> re-authentication completely before allowing anything
> >>> >>>>>> else
> >>> >>>>>>> to
> >>> >>>>>>>>>>>> traverse
> >>> >>>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> connection, and we agree that this is how the
> >>> >>> low-level
> >>> >>>>>>>>>>>> implementation
> >>> >>>>>>>>>>>>>>> must
> >>> >>>>>>>>>>>>>>>> work without significant work/changes.  The current
> >>> >>> PR
> >>> >>>>>>>>>>>> implementation
> >>> >>>>>>>>>>>>>>>> interleaves re-authentication requests with the
> >>> >>>>>> existing
> >>> >>>>>>> flow
> >>> >>>>>>>>> in
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> asynchronous I/O uses cases (Consumer, Producer,
> >>> >>> Admin
> >>> >>>>>>>> Client,
> >>> >>>>>>>>>>>> etc.),
> >>> >>>>>>>>>>>>>> and
> >>> >>>>>>>>>>>>>>>> it allows the two synchronous use cases
> >>> >>>>>>>>> (ControllerChannelManager
> >>> >>>>>>>>>>>> and
> >>> >>>>>>>>>>>>>>>> ReplicaFetcherBlockingSend/ReplicaFetcherThread) to
> >>> >>>>>> decide
> >>> >>>>>>>>> which
> >>> >>>>>>>>>>>> style
> >>> >>>>>>>>>>>>>>> they
> >>> >>>>>>>>>>>>>>>> want.  I (somewhat arbitrarily, to prove out the
> >>> >>>>>>> flexibility
> >>> >>>>>>>> of
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> high-level approach) coded ControllerChannelManager
> >>> >>> to
> >>> >>>>>> do
> >>> >>>>>>>>>>> complete,
> >>> >>>>>>>>>>>>>>>> non-interleaved re-authentication and
> >>> >>>>>>>>> ReplicaFetcherBlockingSend/
> >>> >>>>>>>>>>>>>>>> ReplicaFetcherThread to interleave requests.  The
> >>> >>>>>> approach
> >>> >>>>>>>> has
> >>> >>>>>>>>>>>>>> impacts on
> >>> >>>>>>>>>>>>>>>> the maximum size of latency spikes: interleaving
> >>> >>>>>> requests
> >>> >>>>>>> can
> >>> >>>>>>>>>>>> decrease
> >>> >>>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> latency.  The benefit of interleaving is tough to
> >>> >>>>>> evaluate
> >>> >>>>>>>>>>> because
> >>> >>>>>>>>>>>> it
> >>> >>>>>>>>>>>>>>> isn't
> >>> >>>>>>>>>>>>>>>> clear what the latency requirement really is.  Note
> >>> >>>>>> that
> >>> >>>>>>>>>>>>>>> re-authentication
> >>> >>>>>>>>>>>>>>>> can entail several network round-trips between the
> >>> >>>>>> client
> >>> >>>>>>> and
> >>> >>>>>>>>> the
> >>> >>>>>>>>>>>>>> broker.
> >>> >>>>>>>>>>>>>>>> Comments in this area would be especially
> >>> >>> appreciated.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> 3) *What happens if re-authentication fails (i.e.
> >>> >>> retry
> >>> >>>>>>>>>>>>>> capability)*.  I
> >>> >>>>>>>>>>>>>>>> think this is where we have not yet settled on what
> >>> >>> the
> >>> >>>>>>>>>>> requirement
> >>> >>>>>>>>>>>> is
> >>> >>>>>>>>>>>>>>>> (even more so than the issue of potential latency
> >>> >>>>>>> mitigation
> >>> >>>>>>>>>>>>>>>> requirements).  Rajini, you had mentioned that
> >>> >>>>>>>>> re-authentication
> >>> >>>>>>>>>>>>>> should
> >>> >>>>>>>>>>>>>>>> work the same way as authentication, but I see the
> >>> >>> two
> >>> >>>>>>>>>>> situations as
> >>> >>>>>>>>>>>>>>> being
> >>> >>>>>>>>>>>>>>>> asymmetric.  In the authentication case, if
> >>> >>>>>> authentication
> >>> >>>>>>>>> fails,
> >>> >>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> connection was never fully established and could not
> >>> >>> be
> >>> >>>>>>> used,
> >>> >>>>>>>>>>> and it
> >>> >>>>>>>>>>>>>> is
> >>> >>>>>>>>>>>>>>>> closed -- TLS negotiation may have taken place, so
> >>> >>> that
> >>> >>>>>>>> effort
> >>> >>>>>>>>> is
> >>> >>>>>>>>>>>>>> lost,
> >>> >>>>>>>>>>>>>>> but
> >>> >>>>>>>>>>>>>>>> beyond that nothing else is lost.  In the
> >>> >>>>>> re-authentication
> >>> >>>>>>>>> case
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> connection is fully established, it is in use, and in
> >>> >>>>>> fact
> >>> >>>>>>> it
> >>> >>>>>>>>> can
> >>> >>>>>>>>>>>>>> remain
> >>> >>>>>>>>>>>>>>> in
> >>> >>>>>>>>>>>>>>>> use for some time going forward as long as the
> >>> >>> existing
> >>> >>>>>>>>>>> credential
> >>> >>>>>>>>>>>>>>> remains
> >>> >>>>>>>>>>>>>>>> unexpired; abandoning it at this point due to a
> >>> >>>>>>>>> re-authentication
> >>> >>>>>>>>>>>>>> failure
> >>> >>>>>>>>>>>>>>>> (which can occur due to no fault of the client --
> >>> >>> i.e.
> >>> >>>>>> a
> >>> >>>>>>>> remote
> >>> >>>>>>>>>>>>>> directory
> >>> >>>>>>>>>>>>>>>> or OAuth server being temporarily down) abandons a
> >>> >>>>>> fully
> >>> >>>>>>>>>>> functioning
> >>> >>>>>>>>>>>>>>>> connection with remaining lifetime.  Am I thinking
> >>> >>>>>> about
> >>> >>>>>>> this
> >>> >>>>>>>>>>>>>> reasonably?
> >>> >>>>>>>>>>>>>>>> I still tend to believe that retries of failed
> >>> >>>>>>>>> re-authentications
> >>> >>>>>>>>>>>> are
> >>> >>>>>>>>>>>>>> a
> >>> >>>>>>>>>>>>>>>> requirement.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> 4) *Size*. The low-level approach is smaller in terms
> >>> >>>>>> of
> >>> >>>>>>>> code.
> >>> >>>>>>>>>>> It
> >>> >>>>>>>>>>>> is
> >>> >>>>>>>>>>>>>> a
> >>> >>>>>>>>>>>>>>>> prototype, so it doesn't have everything the
> >>> >>> high-level
> >>> >>>>>>>>> approach
> >>> >>>>>>>>>>>> has,
> >>> >>>>>>>>>>>>>>> but I
> >>> >>>>>>>>>>>>>>>> have split the PR for the high-level approach into
> >>> >>> two
> >>> >>>>>>>> separate
> >>> >>>>>>>>>>>>>> commits
> >>> >>>>>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>>> facilitate a more apples-to-apples comparison: the
> >>> >>>>>> first
> >>> >>>>>>>> commit
> >>> >>>>>>>>>>>>>> contains
> >>> >>>>>>>>>>>>>>>> infrastructure that would have to be added to the
> >>> >>>>>> low-level
> >>> >>>>>>>>>>>> prototype
> >>> >>>>>>>>>>>>>> if
> >>> >>>>>>>>>>>>>>> we
> >>> >>>>>>>>>>>>>>>> went with that, and the second commit has everything
> >>> >>>>>>> specific
> >>> >>>>>>>>> to
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> high-level approach.  So comparing the low-level
> >>> >>>>>> prototype
> >>> >>>>>>>>> commit
> >>> >>>>>>>>>>>>>> (~350
> >>> >>>>>>>>>>>>>>>> changes) to the second commit in the high-level PR
> >>> >>>>>> (~1400
> >>> >>>>>>>>>>> changes)
> >>> >>>>>>>>>>>> is
> >>> >>>>>>>>>>>>>>>> reasonable. The high-level approach at this point is
> >>> >>>>>> about
> >>> >>>>>>> 4
> >>> >>>>>>>>>>> times
> >>> >>>>>>>>>>>> as
> >>> >>>>>>>>>>>>>>> big,
> >>> >>>>>>>>>>>>>>>> though it does have lots of error/sanity checking and
> >>> >>>>>>>>>>> functionality
> >>> >>>>>>>>>>>>>> (such
> >>> >>>>>>>>>>>>>>>> as retry) that would to at least some degree have to
> >>> >>> be
> >>> >>>>>>> added
> >>> >>>>>>>>> to
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>> low-level implementation.  Still, the high-level
> >>> >>>>>> approach
> >>> >>>>>>> is
> >>> >>>>>>>>> (and
> >>> >>>>>>>>>>>> will
> >>> >>>>>>>>>>>>>>>> likely remain) somewhat bigger than the low-level
> >>> >>>>>> approach.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> I hope I presented this in a balanced and accurate
> >>> >>> way;
> >>> >>>>>>>> Rajini,
> >>> >>>>>>>>>>>> please
> >>> >>>>>>>>>>>>>>>> correct me if I got anything wrong or left out
> >>> >>> anything
> >>> >>>>>>>>>>> important.
> >>> >>>>>>>>>>>>>>>> Apologies in advance if that is the case -- it is
> >>> >>>>>>> unintended.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Again, comments/discussion from the wider community
> >>> >>> at
> >>> >>>>>> this
> >>> >>>>>>>>>>> juncture
> >>> >>>>>>>>>>>>>>> would
> >>> >>>>>>>>>>>>>>>> be especially appreciated.
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> Ron
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>> On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram <
> >>> >>>>>>>>>>>>>> [email protected]>
> >>> >>>>>>>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>> Hi Ron,
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>> Thank you for the analysis. Yes, I agree with the
> >>> >>>>>>>> differences
> >>> >>>>>>>>>>> you
> >>> >>>>>>>>>>>>>> have
> >>> >>>>>>>>>>>>>>>>> pointed out between the two approaches.
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>>  1. Re-authenticating idle connections (or
> >>> >>>>>> connections
> >>> >>>>>>>>>>> nearing
> >>> >>>>>>>>>>>>>> idle
> >>> >>>>>>>>>>>>>>>>>  timeout): With the lower-level approach, there
> >>> >>> is
> >>> >>>>>>>> nothing
> >>> >>>>>>>>>>>>>> stopping
> >>> >>>>>>>>>>>>>>> us
> >>> >>>>>>>>>>>>>>>>> from
> >>> >>>>>>>>>>>>>>>>>  re-authenticating idle connections. We could
> >>> >>>>>> either
> >>> >>>>>>>>> consider
> >>> >>>>>>>>>>>> idle
> >>> >>>>>>>>>>>>>>>>>  connections for re-authentication or we could
> >>> >>>>>>> implement
> >>> >>>>>>>>>>> code in
> >>> >>>>>>>>>>>>>>> broker
> >>> >>>>>>>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>>>>  kill connections that deals with delays
> >>> >>>>>> introduced by
> >>> >>>>>>>>>>> idleness.
> >>> >>>>>>>>>>>>>>> Since
> >>> >>>>>>>>>>>>>>>> we
> >>> >>>>>>>>>>>>>>>>>  expect token lifetimes to be much higher than
> >>> >>>>>>> connection
> >>> >>>>>>>>>>> expiry
> >>> >>>>>>>>>>>>>>> time,
> >>> >>>>>>>>>>>>>>>>>  either would work. And the second approach is
> >>> >>>>>> probably
> >>> >>>>>>>>>>> better.
> >>> >>>>>>>>>>>>>>>>>  2. Interleaving authentication requests and
> >>> >>>>>>> application
> >>> >>>>>>>>>>>> requests
> >>> >>>>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>>>>  reduce impact on latency: It is not impossible
> >>> >>> to
> >>> >>>>>>>>> interleave
> >>> >>>>>>>>>>>> with
> >>> >>>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>>>  low-level approach, but it adds complexity and
> >>> >>>>>> reduces
> >>> >>>>>>>> the
> >>> >>>>>>>>>>>> value
> >>> >>>>>>>>>>>>>> of
> >>> >>>>>>>>>>>>>>>> this
> >>> >>>>>>>>>>>>>>>>>  approach. So it will be good to understand the
> >>> >>>>>>>>> requirements
> >>> >>>>>>>>>>> in
> >>> >>>>>>>>>>>>>> terms
> >>> >>>>>>>>>>>>>>>> of
> >>> >>>>>>>>>>>>>>>>>  latency. In terms of comparing the two
> >>> >>>>>> approaches, it
> >>> >>>>>>> is
> >>> >>>>>>>>>>> fair
> >>> >>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>>> compare
> >>> >>>>>>>>>>>>>>>>>  non-interleaved low-level with the naturally
> >>> >>>>>>> interleaved
> >>> >>>>>>>>>>>>>> high-level
> >>> >>>>>>>>>>>>>>>>>  approach.
> >>> >>>>>>>>>>>>>>>>>  3. Retries: I don't think requirement for
> >>> >>> retries
> >>> >>>>>> is
> >>> >>>>>>>>>>> different
> >>> >>>>>>>>>>>>>>> between
> >>> >>>>>>>>>>>>>>>>>  authentication and re-authentication from a
> >>> >>>>>> client's
> >>> >>>>>>>> point
> >>> >>>>>>>>>>> of
> >>> >>>>>>>>>>>>>> view.
> >>> >>>>>>>>>>>>>>> We
> >>> >>>>>>>>>>>>>>>>>  should either support retries for both or not at
> >>> >>>>>> all.
> >>> >>>>>>> We
> >>> >>>>>>>>>>>>>> currently
> >>> >>>>>>>>>>>>>>>>> process
> >>> >>>>>>>>>>>>>>>>>  authentication failures as fatal errors. We
> >>> >>> expect
> >>> >>>>>>> that
> >>> >>>>>>>> if
> >>> >>>>>>>>>>> you
> >>> >>>>>>>>>>>>>> are
> >>> >>>>>>>>>>>>>>>> using
> >>> >>>>>>>>>>>>>>>>>  external authentication servers, those servers
> >>> >>>>>> will be
> >>> >>>>>>>>>>> highly
> >>> >>>>>>>>>>>>>>>> available.
> >>> >>>>>>>>>>>>>>>>>  Our built-in authentication mechanisms avoid
> >>> >>>>>> accessing
> >>> >>>>>>>>>>> external
> >>> >>>>>>>>>>>>>>>> servers
> >>> >>>>>>>>>>>>>>>>> in
> >>> >>>>>>>>>>>>>>>>>  the authentication path (during authentication,
> >>> >>>>>>>>>>> verification is
> >>> >>>>>>>>>>>>>>>>> performed
> >>> >>>>>>>>>>>>>>>>>  using private/public keys or cache lookup). This
> >>> >>>>>> is
> >>> >>>>>>>>>>> necessary
> >>> >>>>>>>>>>>> in
> >>> >>>>>>>>>>>>>> our
> >>> >>>>>>>>>>>>>>>>> design
> >>> >>>>>>>>>>>>>>>>>  since authentications are performed
> >>> >>> synchronously
> >>> >>>>>> on
> >>> >>>>>>> the
> >>> >>>>>>>>>>>> network
> >>> >>>>>>>>>>>>>>>> thread.
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>> One more point with the lower-level mechanism: If
> >>> >>>>>> one day
> >>> >>>>>>>> we
> >>> >>>>>>>>>>>> decided
> >>> >>>>>>>>>>>>>>> that
> >>> >>>>>>>>>>>>>>>>> we want to extend this for SSL, it would be
> >>> >>> possible
> >>> >>>>>> to
> >>> >>>>>>>>> change
> >>> >>>>>>>>>>>> just
> >>> >>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>>> transport layer and make it work for SSL. We may
> >>> >>>>>> never
> >>> >>>>>>> want
> >>> >>>>>>>>> to
> >>> >>>>>>>>>>> do
> >>> >>>>>>>>>>>>>> this,
> >>> >>>>>>>>>>>>>>>> but
> >>> >>>>>>>>>>>>>>>>> thought I would point out anyway.
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>> I think the next step would be to add the low-level
> >>> >>>>>>>> approach
> >>> >>>>>>>>> to
> >>> >>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>> KIP
> >>> >>>>>>>>>>>>>>>>> under `Rejected Alternatives` with details on the
> >>> >>>>>>>>> differences.
> >>> >>>>>>>>>>> And
> >>> >>>>>>>>>>>>>> more
> >>> >>>>>>>>>>>>>>>>> detail on requirements covering latency and retries
> >>> >>>>>> in
> >>> >>>>>>> the
> >>> >>>>>>>>>>> body of
> >>> >>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>>> KIP.  For "Rejected Alternatives", can we add
> >>> >>>>>>> sub-sections
> >>> >>>>>>>> (I
> >>> >>>>>>>>>>>> think
> >>> >>>>>>>>>>>>>>> there
> >>> >>>>>>>>>>>>>>>>> are two approaches there now, but you need to read
> >>> >>>>>>> through
> >>> >>>>>>>> to
> >>> >>>>>>>>>>>> figure
> >>> >>>>>>>>>>>>>>> that
> >>> >>>>>>>>>>>>>>>>> out, so sub-titles would help).
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>> Once the KIP is updated, it will be good to get
> >>> >>> more
> >>> >>>>>>>> feedback
> >>> >>>>>>>>>>> from
> >>> >>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>>> community to decide on the approach to adopt.
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>> On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino <
> >>> >>>>>>>>>>> [email protected]>
> >>> >>>>>>>>>>>>>>> wrote:
> >>> >>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>>> Hi yet again, Rajini.  If we decide that
> >>> >>>>>> interleaving
> >>> >>>>>>> is
> >>> >>>>>>>>>>>>>> impossible
> >>> >>>>>>>>>>>>>>>> with
> >>> >>>>>>>>>>>>>>>>>> the low-level approach but we still want to at
> >>> >>>>>> least
> >>> >>>>>>>>> support
> >>> >>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>>>> possibility given that it reduces the size of
> >>> >>>>>> latency
> >>> >>>>>>>>> spikes,
> >>> >>>>>>>>>>>>>> then I
> >>> >>>>>>>>>>>>>>> do
> >>> >>>>>>>>>>>>>>>>>> have a suggestion.  I documented in the KIP how I
> >>> >>>>>>>> rejected
> >>> >>>>>>>>> a
> >>> >>>>>>>>>>>>>> single,
> >>> >>>>>>>>>>>>>>>>>> one-size-fits-all queue approach because we don't
> >>> >>>>>> know
> >>> >>>>>>>> when
> >>> >>>>>>>>>>>> poll()
> >>> >>>>>>>>>>>>>>> will
> >>> >>>>>>>>>>>>>>>>> be
> >>> >>>>>>>>>>>>>>>>>> called in the synchronous I/O use case.  An
> >>> >>>>>> advantage
> >>> >>>>>>> of
> >>> >>>>>>>>>>> such an
> >>> >>>>>>>>>>>>>>>> approach
> >>> >>>>>>>>>>>>>>>>>> -- if it would have worked, which it didn't -- is
> >>> >>>>>> that
> >>> >>>>>>> it
> >>> >>>>>>>>>>> would
> >>> >>>>>>>>>>>>>> have
> >>> >>>>>>>>>>>>>>>> been
> >>> >>>>>>>>>>>>>>>>>> transparent to the owner of the NetworkClient
> >>> >>>>>> instance
> >>> >>>>>>>>> (this
> >>> >>>>>>>>>>> is
> >>> >>>>>>>>>>>>>> one
> >>> >>>>>>>>>>>>>>> of
> >>> >>>>>>>>>>>>>>>>> the
> >>> >>>>>>>>>>>>>>>>>> big advantages of the low-level approach, after
> >>> >>>>>> all).
> >>> >>>>>>>> But
> >>> >>>>>>>>> we
> >>> >>>>>>>>>>>>>> could
> >>> >>>>>>>>>>>>>>>> make
> >>> >>>>>>>>>>>>>>>>>> the one-size-fits-all queue approach work if we
> >>> >>> are
> >>> >>>>>>>> willing
> >>> >>>>>>>>>>> to
> >>> >>>>>>>>>>>>>> impose
> >>> >>>>>>>>>>>>>>>>> some
> >>> >>>>>>>>>>>>>>>>>> requirements on synchronous I/O users of
> >>> >>>>>> NetworkClient.
> >>> >>>>>>>>>>>>>>> Specifically,
> >>> >>>>>>>>>>>>>>>> we
> >>> >>>>>>>>>>>>>>>>>> could add a method to the
> >>> >>>>>>>>>>> org.apache.kafka.clients.KafkaClient
> >>> >>>>>>>>>>>>>>>> interface
> >>> >>>>>>>>>>>>>>>>>> (which NetworkClient implements) as follows:
> >>> >>>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>>>   /**
> >>> >>>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>>>    * Return true if any node has
> >>> >>>>>> re-authentication
> >>> >>>>>>>>> requests
> >>> >>>>>>>>>>>>>> waiting
> >>> >>>>>>>>>>>>>>>> to
> >>> >>>>>>>>>>>>>>>>> be
> >>> >>>>>>>>>>>>>>>>>> sent. A
> >>> >>>>>>>>>>>>>>>>>>
> >>> >>>>>>>>>>>>>>>>>>    * call to {@link #poll(long, long)} is
> >>> >>>>>> required to
> >>> >>>>>>>>> send
> >>> >>>>>>>>>>>> such
> >>> >>>>>>>>>>>>>>>>> requests.
> >>> >>>>>>>>>>>>>>>>>> An owner
> >>> >>>>>>>>>>>>>>>>>>
> >>> >
> >>
> >>
>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to