Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Rajini Sivaram Fri, 14 Sep 2018 08:59:52 -0700

It will be good to shorten the config name. We have a config named `
connections.max.idle.ms`. We could add something like `
connections.max.expired.credentials.ms`. As you said, it would be prefixed
with listener and SASL mechanism name. We should be able to support
connection termination in future even with SSL, so perhaps we don't want
the `sasl.server.` prefix (we just need to validate whether the config is
supported for the protocol). You can choose whether a boolean flag or a
time interval makes more sense.


For the client-side, the KIP explains how to make it work for other
mechanisms and we can leave it that. Not convinced about server-side
though. It seems to me that we probably would require API changes to make
it work. Basically the proposed approach works only for OAUTHBEARER. Since
we have to bump up SaslHandshakeRequest version in this KIP, it will be
good to work out if we need to change the request or flow to make this work
for other mechanisms. I haven't figured out exactly what is needed, but
will think about it and get back next week. In the meantime, you can get
the KIP up-to-date with the config, migration path etc. and get it ready to
start vote next week.





On Fri, Sep 14, 2018 at 1:24 PM, Ron Dagostino <[email protected]> wrote:

> Hi Rajini.
>
> Agreed, we will bump the SaslHandshakeRequest API number (no wire format
> change, of course).
>
> Agreed, responding to re-authentication will always be enabled on the
> broker since it is initiated by the client.
>
> Agreed, connection termination by the broker will be opt-in.  I'm thinking
> *sasl.server.disconnect.expired.credential.connections.enable*, prefixed
> with listener and SASL mechanism name in lower-case so it can be opt-in at
> the granularity of a SASL mechanism; for example, "
> listener.name.sasl_ssl.oauthbearer.sasl.server.
> disconnect.expired.credential.connections.enable=true
> ".  It is long, so if you have a preference for a shorter name let me know
> and we'll go with it instead.
>
> Agreed, additional metrics will be helpful.  I'll add them to the KIP.
>
> <<<need to work out exactly how this works with all SASL mechanisms
> <<<not sure we have the data required on the broker or a way of
> <<< extending the [GSSAPI] mechanism
> Not sure I agree here.  The paragraph I added to the KIP describes how I
> think this can be done.  Given that description, do you still feel Kerberos
> will not be possible?  If Kerberos presents a significant hurdle then I
> don't think that should prevent us from doing it for other mechanisms -- I
> would rather state that it isn't supported with GSSAPI due to limitations
> in Kerberos itself than not have it for OAUTHBEARER, for example.  And
> right now I don't think we need more than a high-level description of how
> this could be done.  In other words, I think we have this covered, unless
> there is a fundamental problem due to Kerberos not making data (i.e. ticket
> expiration) available on the server.   I want to submit this KIP for a vote
> early next week in the hope of having it accepted by the Monday, 9/24
> deadline for the 2.1.0 release, and if that happens I believe I can get a
> PR into really good shape soon thereafter (I'm working on it now).
>
> Ron
>
>
>
>
>
>
> On Fri, Sep 14, 2018 at 5:57 AM Rajini Sivaram <[email protected]>
> wrote:
>
> > Hi Ron,
> >
> > 1) Yes, we should update the version of `SaslHandshakeRequest`. Clients
> > would attempt to re-authenticate only if broker's SaslHandshakeRequest
> > version >=2.
> > 2) I think we should enable broker's re-authentication and connection
> > termination code regardless of client version. Re-authentication could be
> > always enabled since it is triggered only by clients and broker could
> just
> > handle it. Connection termination should be opt-in, but should work with
> > clients of any version. If turned on and clients are of an older version,
> > this would simply force reconnection, which should be ok. Additional
> > metrics to monitor expired connections may be useful anyway.
> >
> > We also need to work out exactly how this works with all SASL mechanisms.
> > In particular, we have ensure that we can make it work with Kerberos
> since
> > we use the implementation from the JRE and I am not sure we have the data
> > required on the broker or a way of extending the mechanism.
> >
> > Thoughts?
> >
> > On Thu, Sep 13, 2018 at 6:56 PM, Ron Dagostino <[email protected]>
> wrote:
> >
> > > Hi Rajini.  I'm thinking about how we deal with migration.   For
> example,
> > > let's say we have an existing 2.0.0 cluster using SASL/OAUTHBEARER and
> we
> > > want to use this feature.  The desired end state is to have all clients
> > and
> > > brokers migrated to a version that supports the feature (call it 2.x)
> > with
> > > the feature turned on.  We need to document the supported path(s) to
> this
> > > end state.
> > >
> > > Here are a couple of scenarios with implications:
> > >
> > > 1) *Migrate client to 2.x and turn the feature on but broker is still
> at
> > > 2.0.*  In this case the client is going to get an error when it sends
> the
> > > SaslHandshakeRequest.  We want to avoid this.  It seems to me the
> client
> > > needs to know if the broker has been upgraded to the necessary version
> > > before trying to re-authenticate, which makes me believe we need to
> bump
> > > the API version number in 2.x and the client has to check to see if the
> > max
> > > version the broker supports meets a minimum standard before trying to
> > > re-authenticate.  Do you agree?
> > >
> > > 2) *Migrate broker to 2.x and turn the feature on but client is still
> at
> > > 2.0*.  In this case the broker is going to end up killing connections
> > > periodically.   Again, we want to avoid this.  It's tempting to say
> > "don't
> > > do this" but I wonder if we can require installations to upgrade all
> > > clients before turning the feature on at the brokers.  Certainly we can
> > say
> > > "don't do this" for inter-broker clients -- in other words, we can say
> > that
> > > all brokers in a cluster should be upgraded before the feature is
> turned
> > on
> > > for any one of them -- but I don't know about our ability to say it for
> > > non-broker clients.
> > >
> > > Now we consider the cases where both the brokers and the clients have
> > been
> > > upgraded to 2.x.  When and where should the feature be turned on?  The
> > > inter-broker case is interesting because I don't think think we can
> > require
> > > installations to restart every broker with a new config where the
> feature
> > > is turned on at the same time.  Do you agree that we cannot require
> this
> > > and therefore must support installations restarting brokers with a new
> > > config at whatever pace they need -- which may be quite slow relative
> to
> > > token lifetimes?  Assuming you do agree, then there is going to be the
> > case
> > > where some brokers are going to have the feature turned on and some
> > clients
> > > (definitely inter-broker clients at a minimum) are going to have the
> > > feature turned off.
> > >
> > > The above discussion assumes a single config on the broker side that
> > turns
> > > on both the inter-broker client re-authentication feature as well as
> the
> > > server-side expired-connection-kill feature, but now I'm thinking we
> need
> > > the ability to turn these features on independently, plus maybe we
> need a
> > > way to monitor the number of "active, expired" connections to the
> server
> > so
> > > that operators can be sure that all clients have been upgraded/enabled
> > > before turning on the server-side expired-connection-kill feature.
> > >
> > > So the migration plan would be as follows:
> > >
> > > 1) Upgrade all brokers to 2.x.
> > > 2) After all brokers are upgraded, turn on re-authentication for all
> > > brokers at whatever rate is desired -- just eventually, at some point,
> > get
> > > the client-side feature turned on for all brokers so that inter-broker
> > > connections are re-authenticating.
> > > 3) In parallel with (1) and (2) above, upgrade clients to 2.x and turn
> > > their re-authentication feature on.  Clients will check the API version
> > and
> > > only re-authenticate to a broker that has also been upgraded (note that
> > the
> > > ability of a broker to respond to a re-authentication cannot be turned
> > off
> > > -- it is always on beginning with version 2.x, and it just sits there
> > doing
> > > nothing if it isn't exercised by an enabled client)
> > > 4) After (1), (2), and (3) are complete, check the 2.x broker metrics
> to
> > > confirm that there are no "active, expired" connections.  Once you are
> > > satisfied that (1), (2), and (3) are indeed complete you can enable the
> > > server-side expired-connection-kill feature on each broker via a
> restart
> > at
> > > whatever pace is desired.
> > >
> > > Comments?
> > >
> > > Ron
> > >
> > >
> > > On Wed, Sep 12, 2018 at 4:48 PM Ron Dagostino <[email protected]>
> wrote:
> > >
> > > > Ok, I am tempted to just say we go with the low-level approach since
> it
> > > is
> > > > the quickest and seems to meet the clear requirements. We can always
> > > adjust
> > > > later if we get to clarity on other requirements or we decide we need
> > to
> > > > approach it differently for whatever reason.  But in the meantime,
> > before
> > > > fully committing to this decision, I would appreciate another
> > perspective
> > > > if someone has one.
> > > >
> > > > Ron
> > > >
> > > > On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <
> > [email protected]>
> > > > wrote:
> > > >
> > > >> Hi Ron,
> > > >>
> > > >> Yes, I would leave out retries from this KIP for now. In the future,
> > if
> > > >> there is a requirement for supporting retries, we can consider it. I
> > > think
> > > >> we can support retries with either approach if we needed to, but it
> > > would
> > > >> be better to do it along with other changes required to support
> > > >> authentication servers that are not highly available.
> > > >>
> > > >> For maintainability, I am biased, so it will be good to get the
> > > >> perspective
> > > >> of others in the community :-)
> > > >>
> > > >> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <[email protected]>
> > > wrote:
> > > >>
> > > >> > Hi Rajini.  Here is some feedback/some things I thought of.
> > > >> >
> > > >> > First, I just realized that from a timing perspective that I am
> not
> > > sure
> > > >> > retry is going to be necessary.  The background login refresh
> thread
> > > >> > triggers re-authentication when it refreshes the client's
> > credential.
> > > >> The
> > > >> > OAuth infrastructure has to be available in order for this refresh
> > to
> > > >> > succeed (the background thread repeatedly retries if it can't
> > refresh
> > > >> the
> > > >> > credential, and that retry loop handles any temporary outage).  If
> > > >> clients
> > > >> > are told to re-authenticate when the credential is refreshed **and
> > > they
> > > >> > actually re-authenticate at that point** then it is highly
> unlikely
> > > that
> > > >> > the OAuth infrastructure would fail within those intervening
> > > >> milliseconds.
> > > >> > So we don't need re-authentication retry in this KIP as long as
> > retry
> > > >> > starts immediately.  The low-level prototype as currently coded
> > > doesn't
> > > >> > actually start re-authentication until the connection is
> > subsequently
> > > >> used,
> > > >> > and that could take a while.  But then again, if the connection
> > isn't
> > > >> used
> > > >> > for some period of time, then the lost value of having to abandon
> > the
> > > >> > connection is lessened anyway.  Plus, as you pointed out, OAuth
> > > >> > infrastructure is assumed to be highly available.
> > > >> >
> > > >> > Does this makes sense, and would you be willing to say that retry
> > > isn't
> > > >> a
> > > >> > necessary requirement?  I'm tempted but would like to hear your
> > > >> perspective
> > > >> > first.
> > > >> >
> > > >> > Interleaving/latency and maintainability (more than lines of code)
> > are
> > > >> the
> > > >> > two remaining areas of comparison.  I did not add the
> > maintainability
> > > >> issue
> > > >> > to the KIP yet, but before I do I thought I would address it here
> > > first
> > > >> to
> > > >> > see if we can come to consensus on it because I'm not sure I see
> the
> > > >> > high-level approach as being hard to maintain (yet -- I could be
> > > >> > convinced/convince myself; we'll see).  I do want to make sure we
> > are
> > > on
> > > >> > the same page about what is required to add re-authentication
> > support
> > > in
> > > >> > the high-level case.  Granted, the amount is zero in the low-level
> > > case,
> > > >> > and it doesn't get any better that that, but the amount in the
> > > >> high-level
> > > >> > case is very low -- just a few lines of code.  For example:
> > > >> >
> > > >> > KafkaAdminClient:
> > > >> > https://github.com/apache/kafka/pull/5582/commits/4fa70f38b9
> > > >> > d33428ff98b64a3a2bd668f5f28c38#diff-
> 6869b8fccf6b098cbcb0676e8ceb26a7
> > > >> > It is the same few lines of code for KafkaConsumer, KafkaProducer,
> > > >> > WorkerGroupMember, and TransactionMarkerChannelManager
> > > >> >
> > > >> > The two synchronous I/O use cases are ControllerChannelManager and
> > > >> > ReplicaFetcherBlockingSend (via ReplicaFetcherThread), and they
> > > require
> > > >> a
> > > >> > little bit more -- but not much.
> > > >> >
> > > >> > Thoughts?
> > > >> >
> > > >> > Ron
> > > >> >
> > > >> > On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <[email protected]>
> > > >> wrote:
> > > >> >
> > > >> > > Thanks, Rajini.  Before I digest/respond to that, here's an
> update
> > > >> that I
> > > >> > > just completed.
> > > >> > >
> > > >> > > I added a commit to the PR (
> > > >> https://github.com/apache/kafka/pull/5582/)
> > > >> > > that implements server-side kill of expired OAUTHBEARER
> > connections.
> > > >> No
> > > >> > > tests yet since we still haven't settled on a final approach
> > > >> (low-level
> > > >> > vs.
> > > >> > > high-level).
> > > >> > >
> > > >> > > I also updated the KIP to reflect the latest discussion and PR
> as
> > > >> > follows:
> > > >> > >
> > > >> > >    - Include support for brokers killing connections as part of
> > this
> > > >> KIP
> > > >> > >    (rather than deferring it to a future KIP as was originally
> > > >> > mentioned; the
> > > >> > >    PR now includes it as mentioned -- it was very easy to add)
> > > >> > >    - Added metrics (they will mirror existing ones related to
> > > >> > >    authentications; I have not added those to the PR)
> > > >> > >    - Updated the implementation description to reflect the
> current
> > > >> state
> > > >> > >    of the PR, which is a high-level, one-size-fits-all approach
> > (as
> > > >> > opposed to
> > > >> > >    my initial, even-higher-level approach)
> > > >> > >    - Added a "Rejected Alternative" for the first version of the
> > PR,
> > > >> > >    which injected requests directly into synchronous I/O
> clients'
> > > >> queues
> > > >> > >    - Added a "Rejected Alternative" for the low-level approach
> as
> > > >> > >    suggested, but of course we have not formally decided to
> reject
> > > >> this
> > > >> > >    approach or adopt the current PR implementation.
> > > >> > >
> > > >> > > I'll think about where we stand some more before responding
> again.
> > > >> > Thanks
> > > >> > > for the above reply.
> > > >> > >
> > > >> > > Ron
> > > >> > >
> > > >> > > On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
> > > >> [email protected]>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> Hi Ron,
> > > >> > >>
> > > >> > >> Thank you for summarising, I think it covers the differences
> > > between
> > > >> the
> > > >> > >> two approaches well.
> > > >> > >>
> > > >> > >> A few minor points to answer the questions in there:
> > > >> > >>
> > > >> > >> 1) When re-authetication is initiated in the Selector during
> > > poll(),
> > > >> we
> > > >> > >> can
> > > >> > >> move an idle channel to re-authentication state. It is similar
> to
> > > >> > >> injecting
> > > >> > >> requests, but achieved by changing channel back to
> authenticating
> > > >> state.
> > > >> > >>
> > > >> > >> 3) To clarify why I think re-authentication should fit in with
> > our
> > > >> > >> authentication design: My point was not about a specific
> > connection
> > > >> > being
> > > >> > >> usable or not usable. It was about what happens at the client
> API
> > > >> level.
> > > >> > >> Our client API (producer/consumer/admin client etc.) currently
> > > assume
> > > >> > that
> > > >> > >> a single broker authentication failure is a fatal error that is
> > > never
> > > >> > >> retried because we assume that broker only ever fails an
> > > >> authentication
> > > >> > >> request if credentials are invalid. If we ever decide to
> support
> > > >> cases
> > > >> > >> where broker occasionally fails an authentication request due
> to
> > a
> > > >> > >> transient failure, we need to do more around how we handle
> > > >> > authentication
> > > >> > >> failures in clients. We may decide that it is ok to close the
> > > >> connection
> > > >> > >> for authentication and not for re-authentication as you
> > mentioned,
> > > >> but
> > > >> > we
> > > >> > >> need to change the way this disconnection is handled by
> clients.
> > So
> > > >> IMO,
> > > >> > >> we
> > > >> > >> should either add support for transient retriable
> authentication
> > > >> > failures
> > > >> > >> properly or not retry for any scenario. Personally, I don't
> think
> > > we
> > > >> > would
> > > >> > >> want to retry all authentication failures even if it is a
> > > >> > >> re-authentication, I think we could (at some point in future),
> > > allow
> > > >> > >> brokers to return an error code that indicates that it is a
> > > transient
> > > >> > >> broker-side failure rather than invalid credentials and handle
> > the
> > > >> error
> > > >> > >> differently. I see no reason at that point why we wouldn't
> handle
> > > >> > >> authentication and re-authentication in the same way.
> > > >> > >>
> > > >> > >> 4) As you said, the high-level approach would be bigger than
> the
> > > >> > low-level
> > > >> > >> approach in terms of LOC. But I wouldn't be too concerned about
> > > >> lines of
> > > >> > >> code. My bigger concern was about modularity. Our security code
> > is
> > > >> > already
> > > >> > >> complex, protocols like Kerberos and SSL that we use from the
> JRE
> > > >> make
> > > >> > >> problem diagnosis hard. Async I/O makes the networking code
> > > complex.
> > > >> You
> > > >> > >> need to understand networking layer to work with the security
> > > layer,
> > > >> but
> > > >> > >> the rest of the code base doesn't rely on knowledge of
> > > >> network/security
> > > >> > >> layers. My main concern about the high-level approach is that
> it
> > > >> spans
> > > >> > >> these boundaries, making it harder to maintain in the long run.
> > > >> > >>
> > > >> > >>
> > > >> > >> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski <
> > > >> > >> [email protected]> wrote:
> > > >> > >>
> > > >> > >> > Hi Ron, Rajini
> > > >> > >> >
> > > >> > >> > Thanks for summarizing the discussion so far, Ron!
> > > >> > >> >
> > > >> > >> > 1) How often do we have such long-lived connection idleness
> > (e.g
> > > >> 5-10
> > > >> > >> > minutes) in practice?
> > > >> > >> >
> > > >> > >> > 3) I agree that retries for re-authentication are useful.
> > > >> > >> >
> > > >> > >> > 4) The interleaving of requests sounds like a great feature
> to
> > > >> have,
> > > >> > but
> > > >> > >> > the tradeoff against code complexity is a tough one. I would
> > > >> > personally
> > > >> > >> go
> > > >> > >> > with the simpler approach since you could always add
> > interleaving
> > > >> on
> > > >> > >> top if
> > > >> > >> > the community decides the latency should be better.
> > > >> > >> >
> > > >> > >> > Best,
> > > >> > >> > Stanislav
> > > >> > >> >
> > > >> > >> > On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <
> > [email protected]
> > > >
> > > >> > >> wrote:
> > > >> > >> >
> > > >> > >> > > Hi everyone.  I've updated the PR to reflect the latest
> > > >> conclusions
> > > >> > >> from
> > > >> > >> > > this ongoing discussion.  The KIP still needs the suggested
> > > >> > updates; I
> > > >> > >> > will
> > > >> > >> > > do that later this week.  II agree with Rajini that some
> > > >> additional
> > > >> > >> > > feedback from the community at large would be very helpful
> at
> > > >> this
> > > >> > >> point
> > > >> > >> > in
> > > >> > >> > > time.
> > > >> > >> > >
> > > >> > >> > > Here's where we stand.
> > > >> > >> > >
> > > >> > >> > > We have 2 separate implementations for re-authenticating
> > > existing
> > > >> > >> > > connections -- a so-called "low-level" approach and a
> > > (relatively
> > > >> > >> > speaking)
> > > >> > >> > > "high-level" approach -- and we agree on how they should be
> > > >> > compared.
> > > >> > >> > > Specifically, Rajini has provided a mechanism that works
> at a
> > > >> > >> relatively
> > > >> > >> > > low level in the stack by intercepting network sends and
> > > queueing
> > > >> > >> them up
> > > >> > >> > > while re-authentication happens; then the queued sends are
> > sent
> > > >> > after
> > > >> > >> > > re-authentication succeeds, or the connection is closed if
> > > >> > >> > > re-authentication fails.  See the prototype commit at
> > > >> > >> > >
> > > >> > >> > > https://github.com/rajinisivaram/kafka/commit/
> b9d711907ad843
> > > >> > >> > c11d17e80d6743bfb1d4e3f3fd
> > > >> > >> > > .
> > > >> > >> > >
> > > >> > >> > > I have an implementation that works higher up in the stack;
> > it
> > > >> > injects
> > > >> > >> > > authentication requests into the KafkaClient via a new
> method
> > > >> added
> > > >> > to
> > > >> > >> > that
> > > >> > >> > > interface, and the implementation (i.e. NetworkClient)
> sends
> > > them
> > > >> > when
> > > >> > >> > > poll() is called.  The updated PR is available at
> > > >> > >> > > https://github.com/apache/kafka/pull/5582/.
> > > >> > >> > >
> > > >> > >> > > Here is how we think these two approaches should be
> compared.
> > > >> > >> > >
> > > >> > >> > > 1) *When re-authentication starts*.  The low-level approach
> > > >> > initiates
> > > >> > >> > > re-authentication only if/when the connection is actually
> > used,
> > > >> so
> > > >> > it
> > > >> > >> may
> > > >> > >> > > start after the existing credential expires; the current PR
> > > >> > >> > implementation
> > > >> > >> > > injects re-authentication requests into the existing flow,
> > and
> > > >> > >> > > re-authentication starts immediately regardless of whether
> or
> > > not
> > > >> > the
> > > >> > >> > > connection is being used for something else.  Rajini
> believes
> > > the
> > > >> > >> > low-level
> > > >> > >> > > approach can be adjusted to re-authenticate idle
> connections
> > > >> > (Rajini,
> > > >> > >> I
> > > >> > >> > > don't see how this can be done without injecting requests,
> > > which
> > > >> is
> > > >> > >> what
> > > >> > >> > > the high-level connection is doing, but I am probably
> missing
> > > >> > >> something
> > > >> > >> > --
> > > >> > >> > > no need to go into it unless it is easy to explain.)
> > > >> > >> > >
> > > >> > >> > > 2) *When requests not related to re-authentication are able
> > to
> > > >> use
> > > >> > the
> > > >> > >> > > re-authenticating connection*.  The low-level approach
> > finishes
> > > >> > >> > > re-authentication completely before allowing anything else
> to
> > > >> > traverse
> > > >> > >> > the
> > > >> > >> > > connection, and we agree that this is how the low-level
> > > >> > implementation
> > > >> > >> > must
> > > >> > >> > > work without significant work/changes.  The current PR
> > > >> > implementation
> > > >> > >> > > interleaves re-authentication requests with the existing
> flow
> > > in
> > > >> the
> > > >> > >> > > asynchronous I/O uses cases (Consumer, Producer, Admin
> > Client,
> > > >> > etc.),
> > > >> > >> and
> > > >> > >> > > it allows the two synchronous use cases
> > > (ControllerChannelManager
> > > >> > and
> > > >> > >> > > ReplicaFetcherBlockingSend/ReplicaFetcherThread) to decide
> > > which
> > > >> > style
> > > >> > >> > they
> > > >> > >> > > want.  I (somewhat arbitrarily, to prove out the
> flexibility
> > of
> > > >> the
> > > >> > >> > > high-level approach) coded ControllerChannelManager to do
> > > >> complete,
> > > >> > >> > > non-interleaved re-authentication and
> > > ReplicaFetcherBlockingSend/
> > > >> > >> > > ReplicaFetcherThread to interleave requests.  The approach
> > has
> > > >> > >> impacts on
> > > >> > >> > > the maximum size of latency spikes: interleaving requests
> can
> > > >> > decrease
> > > >> > >> > the
> > > >> > >> > > latency.  The benefit of interleaving is tough to evaluate
> > > >> because
> > > >> > it
> > > >> > >> > isn't
> > > >> > >> > > clear what the latency requirement really is.  Note that
> > > >> > >> > re-authentication
> > > >> > >> > > can entail several network round-trips between the client
> and
> > > the
> > > >> > >> broker.
> > > >> > >> > > Comments in this area would be especially appreciated.
> > > >> > >> > >
> > > >> > >> > > 3) *What happens if re-authentication fails (i.e. retry
> > > >> > >> capability)*.  I
> > > >> > >> > > think this is where we have not yet settled on what the
> > > >> requirement
> > > >> > is
> > > >> > >> > > (even more so than the issue of potential latency
> mitigation
> > > >> > >> > > requirements).  Rajini, you had mentioned that
> > > re-authentication
> > > >> > >> should
> > > >> > >> > > work the same way as authentication, but I see the two
> > > >> situations as
> > > >> > >> > being
> > > >> > >> > > asymmetric.  In the authentication case, if authentication
> > > fails,
> > > >> > the
> > > >> > >> > > connection was never fully established and could not be
> used,
> > > >> and it
> > > >> > >> is
> > > >> > >> > > closed -- TLS negotiation may have taken place, so that
> > effort
> > > is
> > > >> > >> lost,
> > > >> > >> > but
> > > >> > >> > > beyond that nothing else is lost.  In the re-authentication
> > > case
> > > >> the
> > > >> > >> > > connection is fully established, it is in use, and in fact
> it
> > > can
> > > >> > >> remain
> > > >> > >> > in
> > > >> > >> > > use for some time going forward as long as the existing
> > > >> credential
> > > >> > >> > remains
> > > >> > >> > > unexpired; abandoning it at this point due to a
> > > re-authentication
> > > >> > >> failure
> > > >> > >> > > (which can occur due to no fault of the client -- i.e. a
> > remote
> > > >> > >> directory
> > > >> > >> > > or OAuth server being temporarily down) abandons a fully
> > > >> functioning
> > > >> > >> > > connection with remaining lifetime.  Am I thinking about
> this
> > > >> > >> reasonably?
> > > >> > >> > > I still tend to believe that retries of failed
> > > re-authentications
> > > >> > are
> > > >> > >> a
> > > >> > >> > > requirement.
> > > >> > >> > >
> > > >> > >> > > 4) *Size*. The low-level approach is smaller in terms of
> > code.
> > > >> It
> > > >> > is
> > > >> > >> a
> > > >> > >> > > prototype, so it doesn't have everything the high-level
> > > approach
> > > >> > has,
> > > >> > >> > but I
> > > >> > >> > > have split the PR for the high-level approach into two
> > separate
> > > >> > >> commits
> > > >> > >> > to
> > > >> > >> > > facilitate a more apples-to-apples comparison: the first
> > commit
> > > >> > >> contains
> > > >> > >> > > infrastructure that would have to be added to the low-level
> > > >> > prototype
> > > >> > >> if
> > > >> > >> > we
> > > >> > >> > > went with that, and the second commit has everything
> specific
> > > to
> > > >> the
> > > >> > >> > > high-level approach.  So comparing the low-level prototype
> > > commit
> > > >> > >> (~350
> > > >> > >> > > changes) to the second commit in the high-level PR (~1400
> > > >> changes)
> > > >> > is
> > > >> > >> > > reasonable. The high-level approach at this point is about
> 4
> > > >> times
> > > >> > as
> > > >> > >> > big,
> > > >> > >> > > though it does have lots of error/sanity checking and
> > > >> functionality
> > > >> > >> (such
> > > >> > >> > > as retry) that would to at least some degree have to be
> added
> > > to
> > > >> the
> > > >> > >> > > low-level implementation.  Still, the high-level approach
> is
> > > (and
> > > >> > will
> > > >> > >> > > likely remain) somewhat bigger than the low-level approach.
> > > >> > >> > >
> > > >> > >> > > I hope I presented this in a balanced and accurate way;
> > Rajini,
> > > >> > please
> > > >> > >> > > correct me if I got anything wrong or left out anything
> > > >> important.
> > > >> > >> > > Apologies in advance if that is the case -- it is
> unintended.
> > > >> > >> > >
> > > >> > >> > > Again, comments/discussion from the wider community at this
> > > >> juncture
> > > >> > >> > would
> > > >> > >> > > be especially appreciated.
> > > >> > >> > >
> > > >> > >> > > Ron
> > > >> > >> > >
> > > >> > >> > >
> > > >> > >> > >
> > > >> > >> > >
> > > >> > >> > > On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram <
> > > >> > >> [email protected]>
> > > >> > >> > > wrote:
> > > >> > >> > >
> > > >> > >> > > > Hi Ron,
> > > >> > >> > > >
> > > >> > >> > > > Thank you for the analysis. Yes, I agree with the
> > differences
> > > >> you
> > > >> > >> have
> > > >> > >> > > > pointed out between the two approaches.
> > > >> > >> > > >
> > > >> > >> > > >    1. Re-authenticating idle connections (or connections
> > > >> nearing
> > > >> > >> idle
> > > >> > >> > > >    timeout): With the lower-level approach, there is
> > nothing
> > > >> > >> stopping
> > > >> > >> > us
> > > >> > >> > > > from
> > > >> > >> > > >    re-authenticating idle connections. We could either
> > > consider
> > > >> > idle
> > > >> > >> > > >    connections for re-authentication or we could
> implement
> > > >> code in
> > > >> > >> > broker
> > > >> > >> > > > to
> > > >> > >> > > >    kill connections that deals with delays introduced by
> > > >> idleness.
> > > >> > >> > Since
> > > >> > >> > > we
> > > >> > >> > > >    expect token lifetimes to be much higher than
> connection
> > > >> expiry
> > > >> > >> > time,
> > > >> > >> > > >    either would work. And the second approach is probably
> > > >> better.
> > > >> > >> > > >    2. Interleaving authentication requests and
> application
> > > >> > requests
> > > >> > >> to
> > > >> > >> > > >    reduce impact on latency: It is not impossible to
> > > interleave
> > > >> > with
> > > >> > >> > the
> > > >> > >> > > >    low-level approach, but it adds complexity and reduces
> > the
> > > >> > value
> > > >> > >> of
> > > >> > >> > > this
> > > >> > >> > > >    approach. So it will be good to understand the
> > > requirements
> > > >> in
> > > >> > >> terms
> > > >> > >> > > of
> > > >> > >> > > >    latency. In terms of comparing the two approaches, it
> is
> > > >> fair
> > > >> > to
> > > >> > >> > > compare
> > > >> > >> > > >    non-interleaved low-level with the naturally
> interleaved
> > > >> > >> high-level
> > > >> > >> > > >    approach.
> > > >> > >> > > >    3. Retries: I don't think requirement for retries is
> > > >> different
> > > >> > >> > between
> > > >> > >> > > >    authentication and re-authentication from a client's
> > point
> > > >> of
> > > >> > >> view.
> > > >> > >> > We
> > > >> > >> > > >    should either support retries for both or not at all.
> We
> > > >> > >> currently
> > > >> > >> > > > process
> > > >> > >> > > >    authentication failures as fatal errors. We expect
> that
> > if
> > > >> you
> > > >> > >> are
> > > >> > >> > > using
> > > >> > >> > > >    external authentication servers, those servers will be
> > > >> highly
> > > >> > >> > > available.
> > > >> > >> > > >    Our built-in authentication mechanisms avoid accessing
> > > >> external
> > > >> > >> > > servers
> > > >> > >> > > > in
> > > >> > >> > > >    the authentication path (during authentication,
> > > >> verification is
> > > >> > >> > > > performed
> > > >> > >> > > >    using private/public keys or cache lookup). This is
> > > >> necessary
> > > >> > in
> > > >> > >> our
> > > >> > >> > > > design
> > > >> > >> > > >    since authentications are performed synchronously on
> the
> > > >> > network
> > > >> > >> > > thread.
> > > >> > >> > > >
> > > >> > >> > > > One more point with the lower-level mechanism: If one day
> > we
> > > >> > decided
> > > >> > >> > that
> > > >> > >> > > > we want to extend this for SSL, it would be possible to
> > > change
> > > >> > just
> > > >> > >> the
> > > >> > >> > > > transport layer and make it work for SSL. We may never
> want
> > > to
> > > >> do
> > > >> > >> this,
> > > >> > >> > > but
> > > >> > >> > > > thought I would point out anyway.
> > > >> > >> > > >
> > > >> > >> > > > I think the next step would be to add the low-level
> > approach
> > > to
> > > >> > the
> > > >> > >> KIP
> > > >> > >> > > > under `Rejected Alternatives` with details on the
> > > differences.
> > > >> And
> > > >> > >> more
> > > >> > >> > > > detail on requirements covering latency and retries in
> the
> > > >> body of
> > > >> > >> the
> > > >> > >> > > > KIP.  For "Rejected Alternatives", can we add
> sub-sections
> > (I
> > > >> > think
> > > >> > >> > there
> > > >> > >> > > > are two approaches there now, but you need to read
> through
> > to
> > > >> > figure
> > > >> > >> > that
> > > >> > >> > > > out, so sub-titles would help).
> > > >> > >> > > >
> > > >> > >> > > > Once the KIP is updated, it will be good to get more
> > feedback
> > > >> from
> > > >> > >> the
> > > >> > >> > > > community to decide on the approach to adopt.
> > > >> > >> > > >
> > > >> > >> > > >
> > > >> > >> > > > On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino <
> > > >> [email protected]>
> > > >> > >> > wrote:
> > > >> > >> > > >
> > > >> > >> > > > > Hi yet again, Rajini.  If we decide that interleaving
> is
> > > >> > >> impossible
> > > >> > >> > > with
> > > >> > >> > > > > the low-level approach but we still want to at least
> > > support
> > > >> the
> > > >> > >> > > > > possibility given that it reduces the size of latency
> > > spikes,
> > > >> > >> then I
> > > >> > >> > do
> > > >> > >> > > > > have a suggestion.  I documented in the KIP how I
> > rejected
> > > a
> > > >> > >> single,
> > > >> > >> > > > > one-size-fits-all queue approach because we don't know
> > when
> > > >> > poll()
> > > >> > >> > will
> > > >> > >> > > > be
> > > >> > >> > > > > called in the synchronous I/O use case.  An advantage
> of
> > > >> such an
> > > >> > >> > > approach
> > > >> > >> > > > > -- if it would have worked, which it didn't -- is that
> it
> > > >> would
> > > >> > >> have
> > > >> > >> > > been
> > > >> > >> > > > > transparent to the owner of the NetworkClient instance
> > > (this
> > > >> is
> > > >> > >> one
> > > >> > >> > of
> > > >> > >> > > > the
> > > >> > >> > > > > big advantages of the low-level approach, after all).
> > But
> > > we
> > > >> > >> could
> > > >> > >> > > make
> > > >> > >> > > > > the one-size-fits-all queue approach work if we are
> > willing
> > > >> to
> > > >> > >> impose
> > > >> > >> > > > some
> > > >> > >> > > > > requirements on synchronous I/O users of NetworkClient.
> > > >> > >> > Specifically,
> > > >> > >> > > we
> > > >> > >> > > > > could add a method to the
> > > >> org.apache.kafka.clients.KafkaClient
> > > >> > >> > > interface
> > > >> > >> > > > > (which NetworkClient implements) as follows:
> > > >> > >> > > > >
> > > >> > >> > > > >     /**
> > > >> > >> > > > >
> > > >> > >> > > > >      * Return true if any node has re-authentication
> > > requests
> > > >> > >> waiting
> > > >> > >> > > to
> > > >> > >> > > > be
> > > >> > >> > > > > sent. A
> > > >> > >> > > > >
> > > >> > >> > > > >      * call to {@link #poll(long, long)} is required to
> > > send
> > > >> > such
> > > >> > >> > > > requests.
> > > >> > >> > > > > An owner
> > > >> > >> > > > >
> > > >> > >> > > > >      * of this instance that does not implement a run
> > loop
> > > to
> > > >> > >> > > repeatedly
> > > >> > >> > > > > call
> > > >> > >> > > > >
> > > >> > >> > > > >      * {@link #poll(long, long)} but instead only sends
> > > >> requests
> > > >> > >> > > > > synchronously
> > > >> > >> > > > >
> > > >> > >> > > > >      * on-demand must call this method periodically --
> > and
> > > >> > invoke
> > > >> > >> > > > >
> > > >> > >> > > > >      * {@link #poll(long, long)} if the return value is
> > > >> {@code
> > > >> > >> true}
> > > >> > >> > --
> > > >> > >> > > > to
> > > >> > >> > > > > ensure
> > > >> > >> > > > >
> > > >> > >> > > > >      * that any re-authentication requests that have
> ben
> > > >> > injected
> > > >> > >> are
> > > >> > >> > > > sent
> > > >> > >> > > > > in a
> > > >> > >> > > > >
> > > >> > >> > > > >      * timely fashion.
> > > >> > >> > > > >
> > > >> > >> > > > >      *
> > > >> > >> > > > >
> > > >> > >> > > > >      * @return true if any node has re-authentication
> > > >> requests
> > > >> > >> > waiting
> > > >> > >> > > to
> > > >> > >> > > > > be sent
> > > >> > >> > > > >
> > > >> > >> > > > >      */
> > > >> > >> > > > >
> > > >> > >> > > > >     default boolean hasReauthenticationRequests() {
> > > >> > >> > > > >
> > > >> > >> > > > >         return false;
> > > >> > >> > > > >
> > > >> > >> > > > >     }
> > > >> > >> > > > >
> > > >> > >> > > > > If we are willing to add an additional method like this
> > and
> > > >> make
> > > >> > >> the
> > > >> > >> > > > minor
> > > >> > >> > > > > adjustments to the code associated with the few
> > synchronous
> > > >> use
> > > >> > >> cases
> > > >> > >> > > > then
> > > >> > >> > > > > I think the high-level approach will work.  We would
> then
> > > >> have
> > > >> > the
> > > >> > >> > > > > possibility of further parameterizing the various
> > > synchronous
> > > >> > I/O
> > > >> > >> use
> > > >> > >> > > > > cases.  For example, there are currently 3:
> > > >> > >> > > > >
> > > >> > >> > > > > ControllerChannelManager
> > > >> > >> > > > > KafkaProducer, via Sender
> > > >> > >> > > > > ReplicaFetcherBlockingSend, via ReplicaFetcherThread
> > > >> > >> > > > >
> > > >> > >> > > > > Is it conceivable that one of these use cases might be
> > more
> > > >> > >> > > > > latency-sensitive than others and might desire
> > > >> interleaving?  A
> > > >> > >> > > > high-level
> > > >> > >> > > > > approach would give us the option of configuring each
> use
> > > >> case
> > > >> > >> > > > accordingly.
> > > >> > >> > > > >
> > > >> > >> > > > > Ron
> > > >> > >> > > > >
> > > >> > >> > > > > On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino <
> > > >> > [email protected]>
> > > >> > >> > > wrote:
> > > >> > >> > > > >
> > > >> > >> > > > > > Hi again, Rajini.  It occurs to me that from a
> > *behavior*
> > > >> > >> > perspective
> > > >> > >> > > > > > there are really 3 fundamental differences between
> the
> > > >> > low-level
> > > >> > >> > > > approach
> > > >> > >> > > > > > you provided and the high-level approach as it
> > currently
> > > >> > exists
> > > >> > >> in
> > > >> > >> > > the
> > > >> > >> > > > > PR:
> > > >> > >> > > > > >
> > > >> > >> > > > > > 1) *When re-authentication starts*.  The low-level
> > > approach
> > > >> > >> > initiates
> > > >> > >> > > > > > re-authentication only if/when the connection is
> > actually
> > > >> > used,
> > > >> > >> so
> > > >> > >> > it
> > > >> > >> > > > may
> > > >> > >> > > > > > start after the existing credential expires; the
> > current
> > > PR
> > > >> > >> > > > > implementation
> > > >> > >> > > > > > injects re-authentication requests into the existing
> > > flow,
> > > >> and
> > > >> > >> > > > > > re-authentication starts immediately regardless of
> > > whether
> > > >> or
> > > >> > >> not
> > > >> > >> > the
> > > >> > >> > > > > > connection is being used for something else.
> > > >> > >> > > > > >
> > > >> > >> > > > > > 2) *When requests not related to re-authentication
> can
> > > use
> > > >> the
> > > >> > >> > > > > > re-authenticating connection*.  The low-level
> approach
> > > >> > finishes
> > > >> > >> > > > > > re-authentication completely before allowing anything
> > > else
> > > >> to
> > > >> > >> > travers
> > > >> > >> > > > the
> > > >> > >> > > > > > connection; the current PR implementation interleaves
> > > >> > >> > > re-authentication
> > > >> > >> > > > > > requests with existing flow requests.
> > > >> > >> > > > > >
> > > >> > >> > > > > > 3) *What happens if re-authentication fails*.  The
> > > >> low-level
> > > >> > >> > approach
> > > >> > >> > > > > > results in a closed connection and does not support
> > > >> retries --
> > > >> > >> at
> > > >> > >> > > least
> > > >> > >> > > > > as
> > > >> > >> > > > > > currently implemented; the current PR implementation
> > > >> supports
> > > >> > >> > > retries.
> > > >> > >> > > > > >
> > > >> > >> > > > > > Do you agree that these are the (only?) behavioral
> > > >> > differences?
> > > >> > >> > > > > >
> > > >> > >> > > > > > For these facets of behavior, I wonder what the
> > > >> requirements
> > > >> > are
> > > >> > >> > for
> > > >> > >> > > > this
> > > >> > >> > > > > > feature.  I think they are as follows:
> > > >> > >> > > > > >
> > > >> > >> > > > > > 1) *When re-authentication starts*: I don't think we
> > > have a
> > > >> > hard
> > > >> > >> > > > > > requirement here when considered in isolation --
> > whether
> > > >> > >> > > > > re-authentication
> > > >> > >> > > > > > starts immediately or is delayed until the connection
> > is
> > > >> used
> > > >> > >> > > probably
> > > >> > >> > > > > > doesn't matter.
> > > >> > >> > > > > >
> > > >> > >> > > > > > 2) *When requests not related to re-authentication
> can
> > > use
> > > >> the
> > > >> > >> > > > > > re-authenticating connection*: there is a tradeoff
> here
> > > >> > between
> > > >> > >> > > latency
> > > >> > >> > > > > > and ability to debug re-authentication problems.
> > > Delaying
> > > >> use
> > > >> > >> of
> > > >> > >> > the
> > > >> > >> > > > > > connection until re-authentication finishes results
> in
> > > >> bigger
> > > >> > >> > latency
> > > >> > >> > > > > > spikes but keeps re-authentication requests somewhat
> > > >> together;
> > > >> > >> > > > > interleaving
> > > >> > >> > > > > > minimizes the size of individual latency spikes but
> > adds
> > > >> some
> > > >> > >> > > > separation
> > > >> > >> > > > > > between the requests.
> > > >> > >> > > > > >
> > > >> > >> > > > > > 3) *What happens if re-authentication fails*: I
> believe
> > > we
> > > >> > have
> > > >> > >> a
> > > >> > >> > > clear
> > > >> > >> > > > > > requirement for retry capability given what I have
> > > written
> > > >> > >> > > previously.
> > > >> > >> > > > > Do
> > > >> > >> > > > > > you agree?  Note that while the current low-level
> > > >> > implementation
> > > >> > >> > does
> > > >> > >> > > > not
> > > >> > >> > > > > > support retry, I have been thinking about how that
> can
> > be
> > > >> > done,
> > > >> > >> > and I
> > > >> > >> > > > am
> > > >> > >> > > > > > pretty sure it can be.  We can keep the old
> > > authenticators
> > > >> on
> > > >> > >> the
> > > >> > >> > > > client
> > > >> > >> > > > > > and server sides and put them back into place if the
> > > >> > >> > > re-authentication
> > > >> > >> > > > > > fails; we would also have to make sure the server
> side
> > > >> doesn't
> > > >> > >> > delay
> > > >> > >> > > > any
> > > >> > >> > > > > > failed re-authentication result and also doesn't
> close
> > > the
> > > >> > >> > connection
> > > >> > >> > > > > upon
> > > >> > >> > > > > > re-authentication failure.  I think I see how to do
> all
> > > of
> > > >> > that.
> > > >> > >> > > > > >
> > > >> > >> > > > > > There are some interactions between the above
> > > requirements.
> > > >> > If
> > > >> > >> > > > > > re-authentication can't start immediately and has to
> > wait
> > > >> for
> > > >> > >> the
> > > >> > >> > > > > > connection to be used then that precludes
> interleaving
> > > >> because
> > > >> > >> we
> > > >> > >> > > can't
> > > >> > >> > > > > be
> > > >> > >> > > > > > sure that the credential will be active by the time
> it
> > is
> > > >> used
> > > >> > >> --
> > > >> > >> > if
> > > >> > >> > > it
> > > >> > >> > > > > > isn't active, and we allow interleaving, then
> requests
> > > not
> > > >> > >> related
> > > >> > >> > to
> > > >> > >> > > > > > re-authentication will fail if server-side
> > > >> > >> > > > > > connection-close-due-to-expired-credential
> > > functionality is
> > > >> > in
> > > >> > >> > place.
> > > >> > >> > > > > Also
> > > >> > >> > > > > > note that any such server-side
> connection-close-due-to-
> > > >> > >> > > > > expired-credential
> > > >> > >> > > > > > functionality would likely have to avoid closing a
> > > >> connection
> > > >> > >> until
> > > >> > >> > > it
> > > >> > >> > > > is
> > > >> > >> > > > > > used for anything other than re-authentication -- it
> > must
> > > >> > allow
> > > >> > >> > > > > > re-authentication requests through when the
> credential
> > is
> > > >> > >> expired.
> > > >> > >> > > > > >
> > > >> > >> > > > > > Given all of the above, it feels to me like the
> > low-level
> > > >> > >> solution
> > > >> > >> > is
> > > >> > >> > > > > > viable only under the following conditions:
> > > >> > >> > > > > >
> > > >> > >> > > > > > 1) We must accept a delay in when re-authentication
> > > occurs
> > > >> to
> > > >> > >> when
> > > >> > >> > > the
> > > >> > >> > > > > > connection is used
> > > >> > >> > > > > > 2) We must accept the bigger latency spikes
> associated
> > > with
> > > >> > not
> > > >> > >> > > > > > interleaving requests
> > > >> > >> > > > > >
> > > >> > >> > > > > > Does this sound right to you, and if so, do you find
> > > these
> > > >> > >> > conditions
> > > >> > >> > > > > > acceptable?  Or have I missed something and/or made
> > > >> incorrect
> > > >> > >> > > > assumptions
> > > >> > >> > > > > > somewhere?
> > > >> > >> > > > > >
> > > >> > >> > > > > > Ron
> > > >> > >> > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > > > On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino <
> > > >> > [email protected]
> > > >> > >> >
> > > >> > >> > > > wrote:
> > > >> > >> > > > > >
> > > >> > >> > > > > >> Hi Rajini.  The code really helped me to understand
> > what
> > > >> you
> > > >> > >> were
> > > >> > >> > > > > >> thinking -- thanks.  I'm still digesting, but here
> are
> > > >> some
> > > >> > >> quick
> > > >> > >> > > > > >> observations:
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> Your approach (I'll call it the "low-level"
> approach,
> > as
> > > >> > >> compared
> > > >> > >> > to
> > > >> > >> > > > the
> > > >> > >> > > > > >> existing PR, which works at a higher level of
> > > >> abstraction) --
> > > >> > >> the
> > > >> > >> > > > > low-level
> > > >> > >> > > > > >> approach is certainly intriguing.  The smaller code
> > size
> > > >> is
> > > >> > >> > welcome,
> > > >> > >> > > > of
> > > >> > >> > > > > >> course, as is the fact that re-authentication simply
> > > works
> > > >> > for
> > > >> > >> > > > everyone
> > > >> > >> > > > > >> regardless of the style of use (async vs. sync I/O).
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> I did notice that re-authentication of the
> connection
> > > >> starts
> > > >> > >> only
> > > >> > >> > > > > if/when
> > > >> > >> > > > > >> the client uses the connection.  For example, if I
> > run a
> > > >> > >> console
> > > >> > >> > > > > producer,
> > > >> > >> > > > > >> re-authentication does not happen unless I try to
> > > produce
> > > >> > >> > something.
> > > >> > >> > > > On
> > > >> > >> > > > > >> the one hand this is potentially good -- if the
> client
> > > >> isn't
> > > >> > >> using
> > > >> > >> > > the
> > > >> > >> > > > > >> connection then the connection stays "silent" and
> > could
> > > be
> > > >> > >> closed
> > > >> > >> > on
> > > >> > >> > > > the
> > > >> > >> > > > > >> server side if it stays idle long enough -- whereas
> if
> > > we
> > > >> > start
> > > >> > >> > > > > injecting
> > > >> > >> > > > > >> re-authentication requests (as is done in the
> > high-level
> > > >> > >> approach)
> > > >> > >> > > > then
> > > >> > >> > > > > the
> > > >> > >> > > > > >> connection never goes completely silent and could
> (?)
> > > >> > >> potentially
> > > >> > >> > > > avoid
> > > >> > >> > > > > >> being closed due to idleness.
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> However, if we implement sever-side kill of
> > connections
> > > >> using
> > > >> > >> > > expired
> > > >> > >> > > > > >> credentials (which we agree is needed), then we
> might
> > > end
> > > >> up
> > > >> > >> with
> > > >> > >> > > the
> > > >> > >> > > > > >> broker killing connections that are sitting idle for
> > > only
> > > >> a
> > > >> > >> short
> > > >> > >> > > > > period of
> > > >> > >> > > > > >> time.  For example, if we refresh the token on the
> > > client
> > > >> > side
> > > >> > >> and
> > > >> > >> > > > tell
> > > >> > >> > > > > the
> > > >> > >> > > > > >> connection that it is eligible to be
> re-authenticated,
> > > >> then
> > > >> > it
> > > >> > >> is
> > > >> > >> > > > > >> conceivable that the connection might be sitting
> idle
> > at
> > > >> that
> > > >> > >> > point
> > > >> > >> > > > and
> > > >> > >> > > > > >> might not be used until after the token it is
> > currently
> > > >> using
> > > >> > >> > > expires.
> > > >> > >> > > > > The
> > > >> > >> > > > > >> server might kill the connection, and that would
> force
> > > the
> > > >> > >> client
> > > >> > >> > to
> > > >> > >> > > > > >> re-connect with a new connection (requiring TLS
> > > >> negotiation).
> > > >> > >> The
> > > >> > >> > > > > >> probability of this happening increases as the token
> > > >> lifetime
> > > >> > >> > > > > decreases, of
> > > >> > >> > > > > >> course, and it can be offset by decreasing the
> window
> > > >> factor
> > > >> > >> (i.e.
> > > >> > >> > > > make
> > > >> > >> > > > > it
> > > >> > >> > > > > >> eligible for re-authenticate at 50% of the lifetime
> > > rather
> > > >> > than
> > > >> > >> > 80%,
> > > >> > >> > > > for
> > > >> > >> > > > > >> example -- it would have to sit idle for longer
> before
> > > the
> > > >> > >> server
> > > >> > >> > > > tried
> > > >> > >> > > > > to
> > > >> > >> > > > > >> kill it).  We haven't implemented server-side kill
> > yet,
> > > so
> > > >> > >> maybe
> > > >> > >> > we
> > > >> > >> > > > > could
> > > >> > >> > > > > >> make it intelligent and only kill the connection if
> it
> > > is
> > > >> > used
> > > >> > >> > (for
> > > >> > >> > > > > >> anything except re-authentication) after the
> > expiration
> > > >> > time...
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> I also wonder about the ability to add retry into
> the
> > > >> > low-level
> > > >> > >> > > > > >> approach.  Do you think it would be possible?  It
> > > doesn't
> > > >> > seem
> > > >> > >> > like
> > > >> > >> > > it
> > > >> > >> > > > > to
> > > >> > >> > > > > >> me -- at least not without some big changes that
> would
> > > >> > >> eliminate
> > > >> > >> > > some
> > > >> > >> > > > of
> > > >> > >> > > > > >> the advantage of being able to reuse the existing
> > > >> > >> authentication
> > > >> > >> > > code.
> > > >> > >> > > > > The
> > > >> > >> > > > > >> reason I ask is because I think retry is necessary.
> > It
> > > is
> > > >> > >> part of
> > > >> > >> > > how
> > > >> > >> > > > > >> refreshes work for both GSSAPI and OAUTHBEARER --
> they
> > > >> > refresh
> > > >> > >> > based
> > > >> > >> > > > on
> > > >> > >> > > > > >> some window factor (i.e. 80%) and implement periodic
> > > retry
> > > >> > upon
> > > >> > >> > > > failure
> > > >> > >> > > > > so
> > > >> > >> > > > > >> that they can maximize the chances of having a new
> > > >> credential
> > > >> > >> > > > available
> > > >> > >> > > > > for
> > > >> > >> > > > > >> any new connection attempt.  Without refresh we
> could
> > > end
> > > >> up
> > > >> > in
> > > >> > >> > the
> > > >> > >> > > > > >> situation where the connection still has some
> lifetime
> > > >> left
> > > >> > >> (10%,
> > > >> > >> > > 20%,
> > > >> > >> > > > > or
> > > >> > >> > > > > >> whatever) but it tries to re-authenticate and cannot
> > > >> through
> > > >> > no
> > > >> > >> > > fault
> > > >> > >> > > > of
> > > >> > >> > > > > >> its own (i.e. token endpoint down, some Kerberos
> > > failure,
> > > >> > etc.)
> > > >> > >> > -=-
> > > >> > >> > > > the
> > > >> > >> > > > > >> connection is closed at that point, and it is then
> > > unable
> > > >> to
> > > >> > >> > > reconnect
> > > >> > >> > > > > >> because of the same temporary problem.  We could end
> > up
> > > >> with
> > > >> > an
> > > >> > >> > > > > especially
> > > >> > >> > > > > >> ill-timed, temporary outage in some non-Kafka system
> > > >> (related
> > > >> > >> to
> > > >> > >> > > OAuth
> > > >> > >> > > > > or
> > > >> > >> > > > > >> Kerberos, or some LDAP directory) causing all
> clients
> > to
> > > >> be
> > > >> > >> kicked
> > > >> > >> > > off
> > > >> > >> > > > > the
> > > >> > >> > > > > >> cluster.  Retry capability seems to me to be the way
> > to
> > > >> > >> mitigate
> > > >> > >> > > this
> > > >> > >> > > > > risk.
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> Anyway, that's it for now.  I really like the
> approach
> > > you
> > > >> > >> > outlined
> > > >> > >> > > --
> > > >> > >> > > > > at
> > > >> > >> > > > > >> least at this point based on my current
> understanding.
> > > I
> > > >> > will
> > > >> > >> > > > continue
> > > >> > >> > > > > to
> > > >> > >> > > > > >> dig in, and I may send more comments/questions.  But
> > for
> > > >> > now, I
> > > >> > >> > > think
> > > >> > >> > > > > the
> > > >> > >> > > > > >> lack of retry -- and my definitely-could-be-wrong
> > sense
> > > >> that
> > > >> > it
> > > >> > >> > > can't
> > > >> > >> > > > > >> easily be added -- is my biggest concern with this
> > > >> low-level
> > > >> > >> > > approach.
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> Ron
> > > >> > >> > > > > >>
> > > >> > >> > > > > >> On Thu, Sep 6, 2018 at 4:57 PM Rajini Sivaram <
> > > >> > >> > > > [email protected]>
> > > >> > >> > > > > >> wrote:
> > > >> > >> > > > > >>
> > > >> > >> > > > > >>> Hi Ron,
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>> The commit
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>> https://github.com/rajinisivaram/kafka/commit/
> > > >> > >> > > > > b9d711907ad843c11d17e80d6743bfb1d4e3f3fd
> > > >> > >> > > > > >>> shows
> > > >> > >> > > > > >>> the kind of flow I was thinking of. It is just a
> > > >> prototype
> > > >> > >> with a
> > > >> > >> > > > fixed
> > > >> > >> > > > > >>> re-authentication period to explore the possibility
> > of
> > > >> > >> > implementing
> > > >> > >> > > > > >>> re-authentication similar to authentication. There
> > will
> > > >> be
> > > >> > >> edge
> > > >> > >> > > cases
> > > >> > >> > > > > to
> > > >> > >> > > > > >>> cover and errors to handle, but hopefully the code
> > > makes
> > > >> the
> > > >> > >> > > approach
> > > >> > >> > > > > >>> clearer than my earlier explanation!
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>> So the differences in the two implementations as
> you
> > > have
> > > >> > >> already
> > > >> > >> > > > > >>> mentioned
> > > >> > >> > > > > >>> earlier.
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>>    1. Re-authentication sequences are not
> interleaved
> > > >> with
> > > >> > >> Kafka
> > > >> > >> > > > > >>> requests.
> > > >> > >> > > > > >>>    As you said, this has a higher impact on
> latency.
> > > IMO,
> > > >> > this
> > > >> > >> > also
> > > >> > >> > > > > >>> makes it
> > > >> > >> > > > > >>>    easier to debug, especially with complex
> protocols
> > > >> like
> > > >> > >> > Kerberos
> > > >> > >> > > > > >>> which are
> > > >> > >> > > > > >>>    notoriously difficult to diagnose.
> > > >> > >> > > > > >>>    2. Re-authentication failures will not be
> retried,
> > > >> they
> > > >> > >> will
> > > >> > >> > be
> > > >> > >> > > > > >>> treated
> > > >> > >> > > > > >>>    as fatal errors similar to authentication
> > failures.
> > > >> IMO,
> > > >> > >> since
> > > >> > >> > > we
> > > >> > >> > > > > >>> rely on
> > > >> > >> > > > > >>>    brokers never rejecting valid authentication
> > > requests
> > > >> > >> (clients
> > > >> > >> > > > treat
> > > >> > >> > > > > >>>    authentication failures as fatal errors), it is
> ok
> > > to
> > > >> > fail
> > > >> > >> on
> > > >> > >> > > > > >>>    re-authentication failure as well.
> > > >> > >> > > > > >>>    3. Re-authentication is performed on the network
> > > >> thread
> > > >> > on
> > > >> > >> > > brokers
> > > >> > >> > > > > >>>    similar to authentication (in your
> > implementation, I
> > > >> > think
> > > >> > >> it
> > > >> > >> > > > would
> > > >> > >> > > > > >>> be on
> > > >> > >> > > > > >>>    the request thread). IMO, it is better do all
> > > >> > >> authentications
> > > >> > >> > > > using
> > > >> > >> > > > > >>> the
> > > >> > >> > > > > >>>    same thread pool.
> > > >> > >> > > > > >>>    4. Code complexity: I don't think actual code
> > > >> complexity
> > > >> > >> would
> > > >> > >> > > be
> > > >> > >> > > > > very
> > > >> > >> > > > > >>>    different between the two implementations. But I
> > > think
> > > >> > >> there
> > > >> > >> > is
> > > >> > >> > > > > value
> > > >> > >> > > > > >>> in
> > > >> > >> > > > > >>>    keeping re-authentication code within existing
> > > >> > >> > network/security
> > > >> > >> > > > > >>> layers. The
> > > >> > >> > > > > >>>    number of classes modified will be smaller and
> the
> > > >> number
> > > >> > >> of
> > > >> > >> > > > > packages
> > > >> > >> > > > > >>>    modified even smaller.
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>> Anyway, let me know what you think:
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>>    - Will this approach work for your scenarios?
> > > >> > >> > > > > >>>    - Do we really need to handle re-authentication
> > > >> > differently
> > > >> > >> > from
> > > >> > >> > > > > >>>    authentication?
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>> On Thu, Sep 6, 2018 at 1:40 PM, Ron Dagostino <
> > > >> > >> [email protected]
> > > >> > >> > >
> > > >> > >> > > > > wrote:
> > > >> > >> > > > > >>>
> > > >> > >> > > > > >>> > Yeah, regarding ControllerChannelManager, it is
> one
> > > of
> > > >> the
> > > >> > >> > > > > synchronous
> > > >> > >> > > > > >>> I/O
> > > >> > >> > > > > >>> > use cases (along with 2 others: KafkaProducer,
> via
> > > >> Sender;
> > > >> > >> and
> > > >> > >> > > > > >>> > ReplicaFetcherBlockingSend, via
> > ReplicaFetcherThread)
> > > >> > where
> > > >> > >> the
> > > >> > >> > > > > >>> assumption
> > > >> > >> > > > > >>> > is complete ownership of the connection. The PR's
> > > >> approach
> > > >> > >> to
> > > >> > >> > > > dealing
> > > >> > >> > > > > >>> with
> > > >> > >> > > > > >>> > that assumption is to inject the
> re-authentication
> > > >> > requests
> > > >> > >> > into
> > > >> > >> > > > the
> > > >> > >> > > > > >>> > owner's existing flow so that they are simply
> sent
> > > >> with a
> > > >> > >> > > callback
> > > >> > >> > > > > that
> > > >> > >> > > > > >>> > NetworkClient executes.  The alternative
> approach,
> > > >> which
> > > >> > is
> > > >> > >> > what
> > > >> > >> > > I
> > > >> > >> > > > > >>> think
> > > >> > >> > > > > >>> > you are investigating, is to allow the connection
> > to
> > > be
> > > >> > >> > > temporarily
> > > >> > >> > > > > >>> taken
> > > >> > >> > > > > >>> > offline and having any attempt by
> > > >> ControllerChannelManager
> > > >> > >> (or
> > > >> > >> > > > other
> > > >> > >> > > > > >>> > synchronous use case owner) to use the connection
> > > >> while it
> > > >> > >> is
> > > >> > >> > in
> > > >> > >> > > > this
> > > >> > >> > > > > >>> > "offline" state result in some kind of pause
> until
> > > the
> > > >> > >> > connection
> > > >> > >> > > > > comes
> > > >> > >> > > > > >>> > back online.  One issue with this approach might
> be
> > > the
> > > >> > >> length
> > > >> > >> > of
> > > >> > >> > > > > time
> > > >> > >> > > > > >>> that
> > > >> > >> > > > > >>> > the connection is unavailable; will it be offline
> > for
> > > >> all
> > > >> > >> > > > > >>> authentication
> > > >> > >> > > > > >>> > requests and responses
> > (ApiVersionsRequest/Response,
> > > >> > >> > > > > >>> > SaslHandshakeRequest/Response, and
> > > >> > SaslAuthenticateRequest/
> > > >> > >> > > > > Response)?
> > > >> > >> > > > > >>> > Note
> > > >> > >> > > > > >>> > the last one could actually be invoked multiple
> > > times,
> > > >> so
> > > >> > >> there
> > > >> > >> > > > could
> > > >> > >> > > > > >>> be 4
> > > >> > >> > > > > >>> > or more round-trips before the authentication
> > "dance"
> > > >> is
> > > >> > >> > > finished.
> > > >> > >> > > > > >>> Will
> > > >> > >> > > > > >>> > the connection be "offline" the entire time, or
> > will
> > > >> it be
> > > >> > >> > placed
> > > >> > >> > > > > back
> > > >> > >> > > > > >>> > "online" in between each request/response pair to
> > > allow
> > > >> > the
> > > >> > >> > owner
> > > >> > >> > > > of
> > > >> > >> > > > > >>> the
> > > >> > >> > > > > >>> > connection to use it -- in which case the
> > > >> authentication
> > > >> > >> > process
> > > >> > >> > > > > would
> > > >> > >> > > > > >>> have
> > > >> > >> > > > > >>> > to wait to get ownership again?  The approach I
> > took
> > > >> > >> > interleaves
> > > >> > >> > > > the
> > > >> > >> > > > > >>> > authentication requests/responses with whatever
> the
> > > >> owner
> > > >> > is
> > > >> > >> > > doing,
> > > >> > >> > > > > so
> > > >> > >> > > > > >>> it
> > > >> > >> > > > > >>> > is conceivable that use of the connection jumps
> > back
> > > >> and
> > > >> > >> forth
> > > >> > >> > > > > between
> > > >> > >> > > > > >>> the
> > > >> > >> > > > > >>> > two purposes.  Such jumping back and forth
> > minimizes
> > > >> any
> > > >> > >> added
> > > >> > >> > > > > latency
> > > >> > >> > > > > >>> due
> > > >> > >> > > > > >>> > to the re-authentication, of course.
> > > >> > >> > > > > >>> >
> > > >> > >> > > > > >>> > Anyway, I'll look forward to finding out what you
> > are
> > > >> able
> > > >> > >> to
> > > >> > >> > > > > conclude.
> > > >> > >> > > > > >>> > Good luck :-)
> > > >> > >> > > > > >>> >
> > > >> > >> > > > > >>> > Ron
> > > >> > >> > > > > >>> >
> > > >> > >> > > > > >>> > On Thu, Sep 6, 2018 at 5:17 AM Rajini Sivaram <
> > > >> > >> > > > > [email protected]
> > > >> > >> > > > > >>> >
> > > >> > >> > > > > >>> > wrote:
> > > >> > >> > > > > >>> >
> > > >> > >> > > > > >>> > > Hi Ron,
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > > Disconnections on the broker-side: I think we
> > > should
> > > >> do
> > > >> > >> > > > > >>> disconnections
> > > >> > >> > > > > >>> > as a
> > > >> > >> > > > > >>> > > separate KIP and PR as you originally intended.
> > But
> > > >> that
> > > >> > >> one
> > > >> > >> > > > could
> > > >> > >> > > > > be
> > > >> > >> > > > > >>> > done
> > > >> > >> > > > > >>> > > separately without requiring KIP-368 as a
> > pre-req.
> > > >> As a
> > > >> > >> > simpler
> > > >> > >> > > > > >>> > > implementation and one that can be used without
> > > >> KIP-368
> > > >> > in
> > > >> > >> > some
> > > >> > >> > > > > >>> cases, we
> > > >> > >> > > > > >>> > > could commit that first since this one may take
> > > >> longer.
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > > I wasn't suggesting that we do move
> > > >> re-authentication to
> > > >> > >> > > > > >>> NetworkClient, I
> > > >> > >> > > > > >>> > > was thinking of the lower layers, handling
> > > >> > authentication
> > > >> > >> and
> > > >> > >> > > > > >>> > > reauthentication at the same layer in a similar
> > > way.
> > > >> Let
> > > >> > >> me
> > > >> > >> > > look
> > > >> > >> > > > > >>> into the
> > > >> > >> > > > > >>> > > code and come up with a more detailed
> explanation
> > > to
> > > >> > avoid
> > > >> > >> > > > > confusion.
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > > I am not too concerned about the imports in
> > > >> > KafkaChannel.
> > > >> > >> As
> > > >> > >> > > you
> > > >> > >> > > > > >>> said, we
> > > >> > >> > > > > >>> > > can improve that if we need to. KafkaChannels
> are
> > > >> aware
> > > >> > of
> > > >> > >> > > > > >>> > > network/authentication states and if that
> > becomes a
> > > >> bit
> > > >> > >> more
> > > >> > >> > > > > >>> complex, I
> > > >> > >> > > > > >>> > > don't think it would matter so much. My concern
> > is
> > > >> about
> > > >> > >> > > changes
> > > >> > >> > > > > like
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > >
> > > >> https://github.com/apache/kafka/pull/5582/files#diff-
> > > >> > >> > > > > >>> > 987fef43991384a3ebec5fb55e53b577
> > > >> > >> > > > > >>> > > in ControllerChannelManager. Those classes
> > > shouldn't
> > > >> > have
> > > >> > >> > deal
> > > >> > >> > > > with
> > > >> > >> > > > > >>> SASL
> > > >> > >> > > > > >>> > or
> > > >> > >> > > > > >>> > > reauthentication. Anyway, I will get back with
> > more
> > > >> > >> detail on
> > > >> > >> > > > what
> > > >> > >> > > > > I
> > > >> > >> > > > > >>> had
> > > >> > >> > > > > >>> > in
> > > >> > >> > > > > >>> > > mind since that may not be viable at all.
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > > On Thu, Sep 6, 2018 at 1:44 AM, Ron Dagostino <
> > > >> > >> > > [email protected]
> > > >> > >> > > > >
> > > >> > >> > > > > >>> wrote:
> > > >> > >> > > > > >>> > >
> > > >> > >> > > > > >>> > > > I just thought of another alternative if the
> > > >> imports
> > > >> > are
> > > >> > >> > the
> > > >> > >> > > > > >>> concern.
> > > >> > >> > > > > >>> > > > KafkaChannel could expose the fact that it
> can
> > > >> create
> > > >> > an
> > > >> > >> > > > > additional
> > > >> > >> > > > > >>> > > > Authenticator instance on the side (what I
> > > >> referred to
> > > >> > >> as
> > > >> > >> > > > > >>> > > > notYetAuthenticatedAuthenticator in the PR)
> and
> > > it
> > > >> > could
> > > >> > >> > let
> > > >> > >> > > > > >>> > > > kafka.server.KafkaApis drive the whole thing
> --
> > > >> create
> > > >> > >> the
> > > >> > >> > > > > >>> instance on
> > > >> > >> > > > > >>> > > the
> > > >> > >> > > > > >>> > > > side, clean it up if it fails, move it into
> > place
> > > >> and
> > > >> > >> close
> > > >> > >> > > the
> > > >> > >> > > > > >>> old one
> > > >> > >> > > > > >>> > > if
> > > >> > >> > > > > >>> > > > it succeeds, etc.  Then Kaf
>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to