Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Ron Dagostino Wed, 12 Sep 2018 13:49:28 -0700

Ok, I am tempted to just say we go with the low-level approach since it is
the quickest and seems to meet the clear requirements. We can always adjust
later if we get to clarity on other requirements or we decide we need to
approach it differently for whatever reason.  But in the meantime, before
fully committing to this decision, I would appreciate another perspective
if someone has one.


Ron

On Wed, Sep 12, 2018 at 3:15 PM Rajini Sivaram <[email protected]>
wrote:

> Hi Ron,
>
> Yes, I would leave out retries from this KIP for now. In the future, if
> there is a requirement for supporting retries, we can consider it. I think
> we can support retries with either approach if we needed to, but it would
> be better to do it along with other changes required to support
> authentication servers that are not highly available.
>
> For maintainability, I am biased, so it will be good to get the perspective
> of others in the community :-)
>
> On Wed, Sep 12, 2018 at 7:47 PM, Ron Dagostino <[email protected]> wrote:
>
> > Hi Rajini.  Here is some feedback/some things I thought of.
> >
> > First, I just realized that from a timing perspective that I am not sure
> > retry is going to be necessary.  The background login refresh thread
> > triggers re-authentication when it refreshes the client's credential.
> The
> > OAuth infrastructure has to be available in order for this refresh to
> > succeed (the background thread repeatedly retries if it can't refresh the
> > credential, and that retry loop handles any temporary outage).  If
> clients
> > are told to re-authenticate when the credential is refreshed **and they
> > actually re-authenticate at that point** then it is highly unlikely that
> > the OAuth infrastructure would fail within those intervening
> milliseconds.
> > So we don't need re-authentication retry in this KIP as long as retry
> > starts immediately.  The low-level prototype as currently coded doesn't
> > actually start re-authentication until the connection is subsequently
> used,
> > and that could take a while.  But then again, if the connection isn't
> used
> > for some period of time, then the lost value of having to abandon the
> > connection is lessened anyway.  Plus, as you pointed out, OAuth
> > infrastructure is assumed to be highly available.
> >
> > Does this makes sense, and would you be willing to say that retry isn't a
> > necessary requirement?  I'm tempted but would like to hear your
> perspective
> > first.
> >
> > Interleaving/latency and maintainability (more than lines of code) are
> the
> > two remaining areas of comparison.  I did not add the maintainability
> issue
> > to the KIP yet, but before I do I thought I would address it here first
> to
> > see if we can come to consensus on it because I'm not sure I see the
> > high-level approach as being hard to maintain (yet -- I could be
> > convinced/convince myself; we'll see).  I do want to make sure we are on
> > the same page about what is required to add re-authentication support in
> > the high-level case.  Granted, the amount is zero in the low-level case,
> > and it doesn't get any better that that, but the amount in the high-level
> > case is very low -- just a few lines of code.  For example:
> >
> > KafkaAdminClient:
> > https://github.com/apache/kafka/pull/5582/commits/4fa70f38b9
> > d33428ff98b64a3a2bd668f5f28c38#diff-6869b8fccf6b098cbcb0676e8ceb26a7
> > It is the same few lines of code for KafkaConsumer, KafkaProducer,
> > WorkerGroupMember, and TransactionMarkerChannelManager
> >
> > The two synchronous I/O use cases are ControllerChannelManager and
> > ReplicaFetcherBlockingSend (via ReplicaFetcherThread), and they require a
> > little bit more -- but not much.
> >
> > Thoughts?
> >
> > Ron
> >
> > On Wed, Sep 12, 2018 at 1:57 PM Ron Dagostino <[email protected]> wrote:
> >
> > > Thanks, Rajini.  Before I digest/respond to that, here's an update
> that I
> > > just completed.
> > >
> > > I added a commit to the PR (https://github.com/apache/kafka/pull/5582/
> )
> > > that implements server-side kill of expired OAUTHBEARER connections.
> No
> > > tests yet since we still haven't settled on a final approach (low-level
> > vs.
> > > high-level).
> > >
> > > I also updated the KIP to reflect the latest discussion and PR as
> > follows:
> > >
> > >    - Include support for brokers killing connections as part of this
> KIP
> > >    (rather than deferring it to a future KIP as was originally
> > mentioned; the
> > >    PR now includes it as mentioned -- it was very easy to add)
> > >    - Added metrics (they will mirror existing ones related to
> > >    authentications; I have not added those to the PR)
> > >    - Updated the implementation description to reflect the current
> state
> > >    of the PR, which is a high-level, one-size-fits-all approach (as
> > opposed to
> > >    my initial, even-higher-level approach)
> > >    - Added a "Rejected Alternative" for the first version of the PR,
> > >    which injected requests directly into synchronous I/O clients'
> queues
> > >    - Added a "Rejected Alternative" for the low-level approach as
> > >    suggested, but of course we have not formally decided to reject this
> > >    approach or adopt the current PR implementation.
> > >
> > > I'll think about where we stand some more before responding again.
> > Thanks
> > > for the above reply.
> > >
> > > Ron
> > >
> > > On Wed, Sep 12, 2018 at 1:36 PM Rajini Sivaram <
> [email protected]>
> > > wrote:
> > >
> > >> Hi Ron,
> > >>
> > >> Thank you for summarising, I think it covers the differences between
> the
> > >> two approaches well.
> > >>
> > >> A few minor points to answer the questions in there:
> > >>
> > >> 1) When re-authetication is initiated in the Selector during poll(),
> we
> > >> can
> > >> move an idle channel to re-authentication state. It is similar to
> > >> injecting
> > >> requests, but achieved by changing channel back to authenticating
> state.
> > >>
> > >> 3) To clarify why I think re-authentication should fit in with our
> > >> authentication design: My point was not about a specific connection
> > being
> > >> usable or not usable. It was about what happens at the client API
> level.
> > >> Our client API (producer/consumer/admin client etc.) currently assume
> > that
> > >> a single broker authentication failure is a fatal error that is never
> > >> retried because we assume that broker only ever fails an
> authentication
> > >> request if credentials are invalid. If we ever decide to support cases
> > >> where broker occasionally fails an authentication request due to a
> > >> transient failure, we need to do more around how we handle
> > authentication
> > >> failures in clients. We may decide that it is ok to close the
> connection
> > >> for authentication and not for re-authentication as you mentioned, but
> > we
> > >> need to change the way this disconnection is handled by clients. So
> IMO,
> > >> we
> > >> should either add support for transient retriable authentication
> > failures
> > >> properly or not retry for any scenario. Personally, I don't think we
> > would
> > >> want to retry all authentication failures even if it is a
> > >> re-authentication, I think we could (at some point in future), allow
> > >> brokers to return an error code that indicates that it is a transient
> > >> broker-side failure rather than invalid credentials and handle the
> error
> > >> differently. I see no reason at that point why we wouldn't handle
> > >> authentication and re-authentication in the same way.
> > >>
> > >> 4) As you said, the high-level approach would be bigger than the
> > low-level
> > >> approach in terms of LOC. But I wouldn't be too concerned about lines
> of
> > >> code. My bigger concern was about modularity. Our security code is
> > already
> > >> complex, protocols like Kerberos and SSL that we use from the JRE make
> > >> problem diagnosis hard. Async I/O makes the networking code complex.
> You
> > >> need to understand networking layer to work with the security layer,
> but
> > >> the rest of the code base doesn't rely on knowledge of
> network/security
> > >> layers. My main concern about the high-level approach is that it spans
> > >> these boundaries, making it harder to maintain in the long run.
> > >>
> > >>
> > >> On Wed, Sep 12, 2018 at 10:23 AM, Stanislav Kozlovski <
> > >> [email protected]> wrote:
> > >>
> > >> > Hi Ron, Rajini
> > >> >
> > >> > Thanks for summarizing the discussion so far, Ron!
> > >> >
> > >> > 1) How often do we have such long-lived connection idleness (e.g
> 5-10
> > >> > minutes) in practice?
> > >> >
> > >> > 3) I agree that retries for re-authentication are useful.
> > >> >
> > >> > 4) The interleaving of requests sounds like a great feature to have,
> > but
> > >> > the tradeoff against code complexity is a tough one. I would
> > personally
> > >> go
> > >> > with the simpler approach since you could always add interleaving on
> > >> top if
> > >> > the community decides the latency should be better.
> > >> >
> > >> > Best,
> > >> > Stanislav
> > >> >
> > >> > On Tue, Sep 11, 2018 at 5:00 AM Ron Dagostino <[email protected]>
> > >> wrote:
> > >> >
> > >> > > Hi everyone.  I've updated the PR to reflect the latest
> conclusions
> > >> from
> > >> > > this ongoing discussion.  The KIP still needs the suggested
> > updates; I
> > >> > will
> > >> > > do that later this week.  II agree with Rajini that some
> additional
> > >> > > feedback from the community at large would be very helpful at this
> > >> point
> > >> > in
> > >> > > time.
> > >> > >
> > >> > > Here's where we stand.
> > >> > >
> > >> > > We have 2 separate implementations for re-authenticating existing
> > >> > > connections -- a so-called "low-level" approach and a (relatively
> > >> > speaking)
> > >> > > "high-level" approach -- and we agree on how they should be
> > compared.
> > >> > > Specifically, Rajini has provided a mechanism that works at a
> > >> relatively
> > >> > > low level in the stack by intercepting network sends and queueing
> > >> them up
> > >> > > while re-authentication happens; then the queued sends are sent
> > after
> > >> > > re-authentication succeeds, or the connection is closed if
> > >> > > re-authentication fails.  See the prototype commit at
> > >> > >
> > >> > > https://github.com/rajinisivaram/kafka/commit/b9d711907ad843
> > >> > c11d17e80d6743bfb1d4e3f3fd
> > >> > > .
> > >> > >
> > >> > > I have an implementation that works higher up in the stack; it
> > injects
> > >> > > authentication requests into the KafkaClient via a new method
> added
> > to
> > >> > that
> > >> > > interface, and the implementation (i.e. NetworkClient) sends them
> > when
> > >> > > poll() is called.  The updated PR is available at
> > >> > > https://github.com/apache/kafka/pull/5582/.
> > >> > >
> > >> > > Here is how we think these two approaches should be compared.
> > >> > >
> > >> > > 1) *When re-authentication starts*.  The low-level approach
> > initiates
> > >> > > re-authentication only if/when the connection is actually used, so
> > it
> > >> may
> > >> > > start after the existing credential expires; the current PR
> > >> > implementation
> > >> > > injects re-authentication requests into the existing flow, and
> > >> > > re-authentication starts immediately regardless of whether or not
> > the
> > >> > > connection is being used for something else.  Rajini believes the
> > >> > low-level
> > >> > > approach can be adjusted to re-authenticate idle connections
> > (Rajini,
> > >> I
> > >> > > don't see how this can be done without injecting requests, which
> is
> > >> what
> > >> > > the high-level connection is doing, but I am probably missing
> > >> something
> > >> > --
> > >> > > no need to go into it unless it is easy to explain.)
> > >> > >
> > >> > > 2) *When requests not related to re-authentication are able to use
> > the
> > >> > > re-authenticating connection*.  The low-level approach finishes
> > >> > > re-authentication completely before allowing anything else to
> > traverse
> > >> > the
> > >> > > connection, and we agree that this is how the low-level
> > implementation
> > >> > must
> > >> > > work without significant work/changes.  The current PR
> > implementation
> > >> > > interleaves re-authentication requests with the existing flow in
> the
> > >> > > asynchronous I/O uses cases (Consumer, Producer, Admin Client,
> > etc.),
> > >> and
> > >> > > it allows the two synchronous use cases (ControllerChannelManager
> > and
> > >> > > ReplicaFetcherBlockingSend/ReplicaFetcherThread) to decide which
> > style
> > >> > they
> > >> > > want.  I (somewhat arbitrarily, to prove out the flexibility of
> the
> > >> > > high-level approach) coded ControllerChannelManager to do
> complete,
> > >> > > non-interleaved re-authentication and ReplicaFetcherBlockingSend/
> > >> > > ReplicaFetcherThread to interleave requests.  The approach has
> > >> impacts on
> > >> > > the maximum size of latency spikes: interleaving requests can
> > decrease
> > >> > the
> > >> > > latency.  The benefit of interleaving is tough to evaluate because
> > it
> > >> > isn't
> > >> > > clear what the latency requirement really is.  Note that
> > >> > re-authentication
> > >> > > can entail several network round-trips between the client and the
> > >> broker.
> > >> > > Comments in this area would be especially appreciated.
> > >> > >
> > >> > > 3) *What happens if re-authentication fails (i.e. retry
> > >> capability)*.  I
> > >> > > think this is where we have not yet settled on what the
> requirement
> > is
> > >> > > (even more so than the issue of potential latency mitigation
> > >> > > requirements).  Rajini, you had mentioned that re-authentication
> > >> should
> > >> > > work the same way as authentication, but I see the two situations
> as
> > >> > being
> > >> > > asymmetric.  In the authentication case, if authentication fails,
> > the
> > >> > > connection was never fully established and could not be used, and
> it
> > >> is
> > >> > > closed -- TLS negotiation may have taken place, so that effort is
> > >> lost,
> > >> > but
> > >> > > beyond that nothing else is lost.  In the re-authentication case
> the
> > >> > > connection is fully established, it is in use, and in fact it can
> > >> remain
> > >> > in
> > >> > > use for some time going forward as long as the existing credential
> > >> > remains
> > >> > > unexpired; abandoning it at this point due to a re-authentication
> > >> failure
> > >> > > (which can occur due to no fault of the client -- i.e. a remote
> > >> directory
> > >> > > or OAuth server being temporarily down) abandons a fully
> functioning
> > >> > > connection with remaining lifetime.  Am I thinking about this
> > >> reasonably?
> > >> > > I still tend to believe that retries of failed re-authentications
> > are
> > >> a
> > >> > > requirement.
> > >> > >
> > >> > > 4) *Size*. The low-level approach is smaller in terms of code.  It
> > is
> > >> a
> > >> > > prototype, so it doesn't have everything the high-level approach
> > has,
> > >> > but I
> > >> > > have split the PR for the high-level approach into two separate
> > >> commits
> > >> > to
> > >> > > facilitate a more apples-to-apples comparison: the first commit
> > >> contains
> > >> > > infrastructure that would have to be added to the low-level
> > prototype
> > >> if
> > >> > we
> > >> > > went with that, and the second commit has everything specific to
> the
> > >> > > high-level approach.  So comparing the low-level prototype commit
> > >> (~350
> > >> > > changes) to the second commit in the high-level PR (~1400 changes)
> > is
> > >> > > reasonable. The high-level approach at this point is about 4 times
> > as
> > >> > big,
> > >> > > though it does have lots of error/sanity checking and
> functionality
> > >> (such
> > >> > > as retry) that would to at least some degree have to be added to
> the
> > >> > > low-level implementation.  Still, the high-level approach is (and
> > will
> > >> > > likely remain) somewhat bigger than the low-level approach.
> > >> > >
> > >> > > I hope I presented this in a balanced and accurate way; Rajini,
> > please
> > >> > > correct me if I got anything wrong or left out anything important.
> > >> > > Apologies in advance if that is the case -- it is unintended.
> > >> > >
> > >> > > Again, comments/discussion from the wider community at this
> juncture
> > >> > would
> > >> > > be especially appreciated.
> > >> > >
> > >> > > Ron
> > >> > >
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Mon, Sep 10, 2018 at 7:42 AM Rajini Sivaram <
> > >> [email protected]>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Ron,
> > >> > > >
> > >> > > > Thank you for the analysis. Yes, I agree with the differences
> you
> > >> have
> > >> > > > pointed out between the two approaches.
> > >> > > >
> > >> > > >    1. Re-authenticating idle connections (or connections nearing
> > >> idle
> > >> > > >    timeout): With the lower-level approach, there is nothing
> > >> stopping
> > >> > us
> > >> > > > from
> > >> > > >    re-authenticating idle connections. We could either consider
> > idle
> > >> > > >    connections for re-authentication or we could implement code
> in
> > >> > broker
> > >> > > > to
> > >> > > >    kill connections that deals with delays introduced by
> idleness.
> > >> > Since
> > >> > > we
> > >> > > >    expect token lifetimes to be much higher than connection
> expiry
> > >> > time,
> > >> > > >    either would work. And the second approach is probably
> better.
> > >> > > >    2. Interleaving authentication requests and application
> > requests
> > >> to
> > >> > > >    reduce impact on latency: It is not impossible to interleave
> > with
> > >> > the
> > >> > > >    low-level approach, but it adds complexity and reduces the
> > value
> > >> of
> > >> > > this
> > >> > > >    approach. So it will be good to understand the requirements
> in
> > >> terms
> > >> > > of
> > >> > > >    latency. In terms of comparing the two approaches, it is fair
> > to
> > >> > > compare
> > >> > > >    non-interleaved low-level with the naturally interleaved
> > >> high-level
> > >> > > >    approach.
> > >> > > >    3. Retries: I don't think requirement for retries is
> different
> > >> > between
> > >> > > >    authentication and re-authentication from a client's point of
> > >> view.
> > >> > We
> > >> > > >    should either support retries for both or not at all. We
> > >> currently
> > >> > > > process
> > >> > > >    authentication failures as fatal errors. We expect that if
> you
> > >> are
> > >> > > using
> > >> > > >    external authentication servers, those servers will be highly
> > >> > > available.
> > >> > > >    Our built-in authentication mechanisms avoid accessing
> external
> > >> > > servers
> > >> > > > in
> > >> > > >    the authentication path (during authentication, verification
> is
> > >> > > > performed
> > >> > > >    using private/public keys or cache lookup). This is necessary
> > in
> > >> our
> > >> > > > design
> > >> > > >    since authentications are performed synchronously on the
> > network
> > >> > > thread.
> > >> > > >
> > >> > > > One more point with the lower-level mechanism: If one day we
> > decided
> > >> > that
> > >> > > > we want to extend this for SSL, it would be possible to change
> > just
> > >> the
> > >> > > > transport layer and make it work for SSL. We may never want to
> do
> > >> this,
> > >> > > but
> > >> > > > thought I would point out anyway.
> > >> > > >
> > >> > > > I think the next step would be to add the low-level approach to
> > the
> > >> KIP
> > >> > > > under `Rejected Alternatives` with details on the differences.
> And
> > >> more
> > >> > > > detail on requirements covering latency and retries in the body
> of
> > >> the
> > >> > > > KIP.  For "Rejected Alternatives", can we add sub-sections (I
> > think
> > >> > there
> > >> > > > are two approaches there now, but you need to read through to
> > figure
> > >> > that
> > >> > > > out, so sub-titles would help).
> > >> > > >
> > >> > > > Once the KIP is updated, it will be good to get more feedback
> from
> > >> the
> > >> > > > community to decide on the approach to adopt.
> > >> > > >
> > >> > > >
> > >> > > > On Sat, Sep 8, 2018 at 1:27 PM, Ron Dagostino <
> [email protected]>
> > >> > wrote:
> > >> > > >
> > >> > > > > Hi yet again, Rajini.  If we decide that interleaving is
> > >> impossible
> > >> > > with
> > >> > > > > the low-level approach but we still want to at least support
> the
> > >> > > > > possibility given that it reduces the size of latency spikes,
> > >> then I
> > >> > do
> > >> > > > > have a suggestion.  I documented in the KIP how I rejected a
> > >> single,
> > >> > > > > one-size-fits-all queue approach because we don't know when
> > poll()
> > >> > will
> > >> > > > be
> > >> > > > > called in the synchronous I/O use case.  An advantage of such
> an
> > >> > > approach
> > >> > > > > -- if it would have worked, which it didn't -- is that it
> would
> > >> have
> > >> > > been
> > >> > > > > transparent to the owner of the NetworkClient instance (this
> is
> > >> one
> > >> > of
> > >> > > > the
> > >> > > > > big advantages of the low-level approach, after all).  But we
> > >> could
> > >> > > make
> > >> > > > > the one-size-fits-all queue approach work if we are willing to
> > >> impose
> > >> > > > some
> > >> > > > > requirements on synchronous I/O users of NetworkClient.
> > >> > Specifically,
> > >> > > we
> > >> > > > > could add a method to the org.apache.kafka.clients.KafkaClient
> > >> > > interface
> > >> > > > > (which NetworkClient implements) as follows:
> > >> > > > >
> > >> > > > >     /**
> > >> > > > >
> > >> > > > >      * Return true if any node has re-authentication requests
> > >> waiting
> > >> > > to
> > >> > > > be
> > >> > > > > sent. A
> > >> > > > >
> > >> > > > >      * call to {@link #poll(long, long)} is required to send
> > such
> > >> > > > requests.
> > >> > > > > An owner
> > >> > > > >
> > >> > > > >      * of this instance that does not implement a run loop to
> > >> > > repeatedly
> > >> > > > > call
> > >> > > > >
> > >> > > > >      * {@link #poll(long, long)} but instead only sends
> requests
> > >> > > > > synchronously
> > >> > > > >
> > >> > > > >      * on-demand must call this method periodically -- and
> > invoke
> > >> > > > >
> > >> > > > >      * {@link #poll(long, long)} if the return value is {@code
> > >> true}
> > >> > --
> > >> > > > to
> > >> > > > > ensure
> > >> > > > >
> > >> > > > >      * that any re-authentication requests that have ben
> > injected
> > >> are
> > >> > > > sent
> > >> > > > > in a
> > >> > > > >
> > >> > > > >      * timely fashion.
> > >> > > > >
> > >> > > > >      *
> > >> > > > >
> > >> > > > >      * @return true if any node has re-authentication requests
> > >> > waiting
> > >> > > to
> > >> > > > > be sent
> > >> > > > >
> > >> > > > >      */
> > >> > > > >
> > >> > > > >     default boolean hasReauthenticationRequests() {
> > >> > > > >
> > >> > > > >         return false;
> > >> > > > >
> > >> > > > >     }
> > >> > > > >
> > >> > > > > If we are willing to add an additional method like this and
> make
> > >> the
> > >> > > > minor
> > >> > > > > adjustments to the code associated with the few synchronous
> use
> > >> cases
> > >> > > > then
> > >> > > > > I think the high-level approach will work.  We would then have
> > the
> > >> > > > > possibility of further parameterizing the various synchronous
> > I/O
> > >> use
> > >> > > > > cases.  For example, there are currently 3:
> > >> > > > >
> > >> > > > > ControllerChannelManager
> > >> > > > > KafkaProducer, via Sender
> > >> > > > > ReplicaFetcherBlockingSend, via ReplicaFetcherThread
> > >> > > > >
> > >> > > > > Is it conceivable that one of these use cases might be more
> > >> > > > > latency-sensitive than others and might desire interleaving?
> A
> > >> > > > high-level
> > >> > > > > approach would give us the option of configuring each use case
> > >> > > > accordingly.
> > >> > > > >
> > >> > > > > Ron
> > >> > > > >
> > >> > > > > On Fri, Sep 7, 2018 at 10:50 PM Ron Dagostino <
> > [email protected]>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > Hi again, Rajini.  It occurs to me that from a *behavior*
> > >> > perspective
> > >> > > > > > there are really 3 fundamental differences between the
> > low-level
> > >> > > > approach
> > >> > > > > > you provided and the high-level approach as it currently
> > exists
> > >> in
> > >> > > the
> > >> > > > > PR:
> > >> > > > > >
> > >> > > > > > 1) *When re-authentication starts*.  The low-level approach
> > >> > initiates
> > >> > > > > > re-authentication only if/when the connection is actually
> > used,
> > >> so
> > >> > it
> > >> > > > may
> > >> > > > > > start after the existing credential expires; the current PR
> > >> > > > > implementation
> > >> > > > > > injects re-authentication requests into the existing flow,
> and
> > >> > > > > > re-authentication starts immediately regardless of whether
> or
> > >> not
> > >> > the
> > >> > > > > > connection is being used for something else.
> > >> > > > > >
> > >> > > > > > 2) *When requests not related to re-authentication can use
> the
> > >> > > > > > re-authenticating connection*.  The low-level approach
> > finishes
> > >> > > > > > re-authentication completely before allowing anything else
> to
> > >> > travers
> > >> > > > the
> > >> > > > > > connection; the current PR implementation interleaves
> > >> > > re-authentication
> > >> > > > > > requests with existing flow requests.
> > >> > > > > >
> > >> > > > > > 3) *What happens if re-authentication fails*.  The low-level
> > >> > approach
> > >> > > > > > results in a closed connection and does not support retries
> --
> > >> at
> > >> > > least
> > >> > > > > as
> > >> > > > > > currently implemented; the current PR implementation
> supports
> > >> > > retries.
> > >> > > > > >
> > >> > > > > > Do you agree that these are the (only?) behavioral
> > differences?
> > >> > > > > >
> > >> > > > > > For these facets of behavior, I wonder what the requirements
> > are
> > >> > for
> > >> > > > this
> > >> > > > > > feature.  I think they are as follows:
> > >> > > > > >
> > >> > > > > > 1) *When re-authentication starts*: I don't think we have a
> > hard
> > >> > > > > > requirement here when considered in isolation -- whether
> > >> > > > > re-authentication
> > >> > > > > > starts immediately or is delayed until the connection is
> used
> > >> > > probably
> > >> > > > > > doesn't matter.
> > >> > > > > >
> > >> > > > > > 2) *When requests not related to re-authentication can use
> the
> > >> > > > > > re-authenticating connection*: there is a tradeoff here
> > between
> > >> > > latency
> > >> > > > > > and ability to debug re-authentication problems.  Delaying
> use
> > >> of
> > >> > the
> > >> > > > > > connection until re-authentication finishes results in
> bigger
> > >> > latency
> > >> > > > > > spikes but keeps re-authentication requests somewhat
> together;
> > >> > > > > interleaving
> > >> > > > > > minimizes the size of individual latency spikes but adds
> some
> > >> > > > separation
> > >> > > > > > between the requests.
> > >> > > > > >
> > >> > > > > > 3) *What happens if re-authentication fails*: I believe we
> > have
> > >> a
> > >> > > clear
> > >> > > > > > requirement for retry capability given what I have written
> > >> > > previously.
> > >> > > > > Do
> > >> > > > > > you agree?  Note that while the current low-level
> > implementation
> > >> > does
> > >> > > > not
> > >> > > > > > support retry, I have been thinking about how that can be
> > done,
> > >> > and I
> > >> > > > am
> > >> > > > > > pretty sure it can be.  We can keep the old authenticators
> on
> > >> the
> > >> > > > client
> > >> > > > > > and server sides and put them back into place if the
> > >> > > re-authentication
> > >> > > > > > fails; we would also have to make sure the server side
> doesn't
> > >> > delay
> > >> > > > any
> > >> > > > > > failed re-authentication result and also doesn't close the
> > >> > connection
> > >> > > > > upon
> > >> > > > > > re-authentication failure.  I think I see how to do all of
> > that.
> > >> > > > > >
> > >> > > > > > There are some interactions between the above requirements.
> > If
> > >> > > > > > re-authentication can't start immediately and has to wait
> for
> > >> the
> > >> > > > > > connection to be used then that precludes interleaving
> because
> > >> we
> > >> > > can't
> > >> > > > > be
> > >> > > > > > sure that the credential will be active by the time it is
> used
> > >> --
> > >> > if
> > >> > > it
> > >> > > > > > isn't active, and we allow interleaving, then requests not
> > >> related
> > >> > to
> > >> > > > > > re-authentication will fail if server-side
> > >> > > > > > connection-close-due-to-expired-credential functionality is
> > in
> > >> > place.
> > >> > > > > Also
> > >> > > > > > note that any such server-side connection-close-due-to-
> > >> > > > > expired-credential
> > >> > > > > > functionality would likely have to avoid closing a
> connection
> > >> until
> > >> > > it
> > >> > > > is
> > >> > > > > > used for anything other than re-authentication -- it must
> > allow
> > >> > > > > > re-authentication requests through when the credential is
> > >> expired.
> > >> > > > > >
> > >> > > > > > Given all of the above, it feels to me like the low-level
> > >> solution
> > >> > is
> > >> > > > > > viable only under the following conditions:
> > >> > > > > >
> > >> > > > > > 1) We must accept a delay in when re-authentication occurs
> to
> > >> when
> > >> > > the
> > >> > > > > > connection is used
> > >> > > > > > 2) We must accept the bigger latency spikes associated with
> > not
> > >> > > > > > interleaving requests
> > >> > > > > >
> > >> > > > > > Does this sound right to you, and if so, do you find these
> > >> > conditions
> > >> > > > > > acceptable?  Or have I missed something and/or made
> incorrect
> > >> > > > assumptions
> > >> > > > > > somewhere?
> > >> > > > > >
> > >> > > > > > Ron
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Fri, Sep 7, 2018 at 5:19 PM Ron Dagostino <
> > [email protected]
> > >> >
> > >> > > > wrote:
> > >> > > > > >
> > >> > > > > >> Hi Rajini.  The code really helped me to understand what
> you
> > >> were
> > >> > > > > >> thinking -- thanks.  I'm still digesting, but here are some
> > >> quick
> > >> > > > > >> observations:
> > >> > > > > >>
> > >> > > > > >> Your approach (I'll call it the "low-level" approach, as
> > >> compared
> > >> > to
> > >> > > > the
> > >> > > > > >> existing PR, which works at a higher level of abstraction)
> --
> > >> the
> > >> > > > > low-level
> > >> > > > > >> approach is certainly intriguing.  The smaller code size is
> > >> > welcome,
> > >> > > > of
> > >> > > > > >> course, as is the fact that re-authentication simply works
> > for
> > >> > > > everyone
> > >> > > > > >> regardless of the style of use (async vs. sync I/O).
> > >> > > > > >>
> > >> > > > > >> I did notice that re-authentication of the connection
> starts
> > >> only
> > >> > > > > if/when
> > >> > > > > >> the client uses the connection.  For example, if I run a
> > >> console
> > >> > > > > producer,
> > >> > > > > >> re-authentication does not happen unless I try to produce
> > >> > something.
> > >> > > > On
> > >> > > > > >> the one hand this is potentially good -- if the client
> isn't
> > >> using
> > >> > > the
> > >> > > > > >> connection then the connection stays "silent" and could be
> > >> closed
> > >> > on
> > >> > > > the
> > >> > > > > >> server side if it stays idle long enough -- whereas if we
> > start
> > >> > > > > injecting
> > >> > > > > >> re-authentication requests (as is done in the high-level
> > >> approach)
> > >> > > > then
> > >> > > > > the
> > >> > > > > >> connection never goes completely silent and could (?)
> > >> potentially
> > >> > > > avoid
> > >> > > > > >> being closed due to idleness.
> > >> > > > > >>
> > >> > > > > >> However, if we implement sever-side kill of connections
> using
> > >> > > expired
> > >> > > > > >> credentials (which we agree is needed), then we might end
> up
> > >> with
> > >> > > the
> > >> > > > > >> broker killing connections that are sitting idle for only a
> > >> short
> > >> > > > > period of
> > >> > > > > >> time.  For example, if we refresh the token on the client
> > side
> > >> and
> > >> > > > tell
> > >> > > > > the
> > >> > > > > >> connection that it is eligible to be re-authenticated, then
> > it
> > >> is
> > >> > > > > >> conceivable that the connection might be sitting idle at
> that
> > >> > point
> > >> > > > and
> > >> > > > > >> might not be used until after the token it is currently
> using
> > >> > > expires.
> > >> > > > > The
> > >> > > > > >> server might kill the connection, and that would force the
> > >> client
> > >> > to
> > >> > > > > >> re-connect with a new connection (requiring TLS
> negotiation).
> > >> The
> > >> > > > > >> probability of this happening increases as the token
> lifetime
> > >> > > > > decreases, of
> > >> > > > > >> course, and it can be offset by decreasing the window
> factor
> > >> (i.e.
> > >> > > > make
> > >> > > > > it
> > >> > > > > >> eligible for re-authenticate at 50% of the lifetime rather
> > than
> > >> > 80%,
> > >> > > > for
> > >> > > > > >> example -- it would have to sit idle for longer before the
> > >> server
> > >> > > > tried
> > >> > > > > to
> > >> > > > > >> kill it).  We haven't implemented server-side kill yet, so
> > >> maybe
> > >> > we
> > >> > > > > could
> > >> > > > > >> make it intelligent and only kill the connection if it is
> > used
> > >> > (for
> > >> > > > > >> anything except re-authentication) after the expiration
> > time...
> > >> > > > > >>
> > >> > > > > >> I also wonder about the ability to add retry into the
> > low-level
> > >> > > > > >> approach.  Do you think it would be possible?  It doesn't
> > seem
> > >> > like
> > >> > > it
> > >> > > > > to
> > >> > > > > >> me -- at least not without some big changes that would
> > >> eliminate
> > >> > > some
> > >> > > > of
> > >> > > > > >> the advantage of being able to reuse the existing
> > >> authentication
> > >> > > code.
> > >> > > > > The
> > >> > > > > >> reason I ask is because I think retry is necessary.  It is
> > >> part of
> > >> > > how
> > >> > > > > >> refreshes work for both GSSAPI and OAUTHBEARER -- they
> > refresh
> > >> > based
> > >> > > > on
> > >> > > > > >> some window factor (i.e. 80%) and implement periodic retry
> > upon
> > >> > > > failure
> > >> > > > > so
> > >> > > > > >> that they can maximize the chances of having a new
> credential
> > >> > > > available
> > >> > > > > for
> > >> > > > > >> any new connection attempt.  Without refresh we could end
> up
> > in
> > >> > the
> > >> > > > > >> situation where the connection still has some lifetime left
> > >> (10%,
> > >> > > 20%,
> > >> > > > > or
> > >> > > > > >> whatever) but it tries to re-authenticate and cannot
> through
> > no
> > >> > > fault
> > >> > > > of
> > >> > > > > >> its own (i.e. token endpoint down, some Kerberos failure,
> > etc.)
> > >> > -=-
> > >> > > > the
> > >> > > > > >> connection is closed at that point, and it is then unable
> to
> > >> > > reconnect
> > >> > > > > >> because of the same temporary problem.  We could end up
> with
> > an
> > >> > > > > especially
> > >> > > > > >> ill-timed, temporary outage in some non-Kafka system
> (related
> > >> to
> > >> > > OAuth
> > >> > > > > or
> > >> > > > > >> Kerberos, or some LDAP directory) causing all clients to be
> > >> kicked
> > >> > > off
> > >> > > > > the
> > >> > > > > >> cluster.  Retry capability seems to me to be the way to
> > >> mitigate
> > >> > > this
> > >> > > > > risk.
> > >> > > > > >>
> > >> > > > > >> Anyway, that's it for now.  I really like the approach you
> > >> > outlined
> > >> > > --
> > >> > > > > at
> > >> > > > > >> least at this point based on my current understanding.  I
> > will
> > >> > > > continue
> > >> > > > > to
> > >> > > > > >> dig in, and I may send more comments/questions.  But for
> > now, I
> > >> > > think
> > >> > > > > the
> > >> > > > > >> lack of retry -- and my definitely-could-be-wrong sense
> that
> > it
> > >> > > can't
> > >> > > > > >> easily be added -- is my biggest concern with this
> low-level
> > >> > > approach.
> > >> > > > > >>
> > >> > > > > >> Ron
> > >> > > > > >>
> > >> > > > > >> On Thu, Sep 6, 2018 at 4:57 PM Rajini Sivaram <
> > >> > > > [email protected]>
> > >> > > > > >> wrote:
> > >> > > > > >>
> > >> > > > > >>> Hi Ron,
> > >> > > > > >>>
> > >> > > > > >>> The commit
> > >> > > > > >>>
> > >> > > > > >>> https://github.com/rajinisivaram/kafka/commit/
> > >> > > > > b9d711907ad843c11d17e80d6743bfb1d4e3f3fd
> > >> > > > > >>> shows
> > >> > > > > >>> the kind of flow I was thinking of. It is just a prototype
> > >> with a
> > >> > > > fixed
> > >> > > > > >>> re-authentication period to explore the possibility of
> > >> > implementing
> > >> > > > > >>> re-authentication similar to authentication. There will be
> > >> edge
> > >> > > cases
> > >> > > > > to
> > >> > > > > >>> cover and errors to handle, but hopefully the code makes
> the
> > >> > > approach
> > >> > > > > >>> clearer than my earlier explanation!
> > >> > > > > >>>
> > >> > > > > >>> So the differences in the two implementations as you have
> > >> already
> > >> > > > > >>> mentioned
> > >> > > > > >>> earlier.
> > >> > > > > >>>
> > >> > > > > >>>    1. Re-authentication sequences are not interleaved with
> > >> Kafka
> > >> > > > > >>> requests.
> > >> > > > > >>>    As you said, this has a higher impact on latency. IMO,
> > this
> > >> > also
> > >> > > > > >>> makes it
> > >> > > > > >>>    easier to debug, especially with complex protocols like
> > >> > Kerberos
> > >> > > > > >>> which are
> > >> > > > > >>>    notoriously difficult to diagnose.
> > >> > > > > >>>    2. Re-authentication failures will not be retried, they
> > >> will
> > >> > be
> > >> > > > > >>> treated
> > >> > > > > >>>    as fatal errors similar to authentication failures.
> IMO,
> > >> since
> > >> > > we
> > >> > > > > >>> rely on
> > >> > > > > >>>    brokers never rejecting valid authentication requests
> > >> (clients
> > >> > > > treat
> > >> > > > > >>>    authentication failures as fatal errors), it is ok to
> > fail
> > >> on
> > >> > > > > >>>    re-authentication failure as well.
> > >> > > > > >>>    3. Re-authentication is performed on the network thread
> > on
> > >> > > brokers
> > >> > > > > >>>    similar to authentication (in your implementation, I
> > think
> > >> it
> > >> > > > would
> > >> > > > > >>> be on
> > >> > > > > >>>    the request thread). IMO, it is better do all
> > >> authentications
> > >> > > > using
> > >> > > > > >>> the
> > >> > > > > >>>    same thread pool.
> > >> > > > > >>>    4. Code complexity: I don't think actual code
> complexity
> > >> would
> > >> > > be
> > >> > > > > very
> > >> > > > > >>>    different between the two implementations. But I think
> > >> there
> > >> > is
> > >> > > > > value
> > >> > > > > >>> in
> > >> > > > > >>>    keeping re-authentication code within existing
> > >> > network/security
> > >> > > > > >>> layers. The
> > >> > > > > >>>    number of classes modified will be smaller and the
> number
> > >> of
> > >> > > > > packages
> > >> > > > > >>>    modified even smaller.
> > >> > > > > >>>
> > >> > > > > >>> Anyway, let me know what you think:
> > >> > > > > >>>
> > >> > > > > >>>    - Will this approach work for your scenarios?
> > >> > > > > >>>    - Do we really need to handle re-authentication
> > differently
> > >> > from
> > >> > > > > >>>    authentication?
> > >> > > > > >>>
> > >> > > > > >>>
> > >> > > > > >>> On Thu, Sep 6, 2018 at 1:40 PM, Ron Dagostino <
> > >> [email protected]
> > >> > >
> > >> > > > > wrote:
> > >> > > > > >>>
> > >> > > > > >>> > Yeah, regarding ControllerChannelManager, it is one of
> the
> > >> > > > > synchronous
> > >> > > > > >>> I/O
> > >> > > > > >>> > use cases (along with 2 others: KafkaProducer, via
> Sender;
> > >> and
> > >> > > > > >>> > ReplicaFetcherBlockingSend, via ReplicaFetcherThread)
> > where
> > >> the
> > >> > > > > >>> assumption
> > >> > > > > >>> > is complete ownership of the connection. The PR's
> approach
> > >> to
> > >> > > > dealing
> > >> > > > > >>> with
> > >> > > > > >>> > that assumption is to inject the re-authentication
> > requests
> > >> > into
> > >> > > > the
> > >> > > > > >>> > owner's existing flow so that they are simply sent with
> a
> > >> > > callback
> > >> > > > > that
> > >> > > > > >>> > NetworkClient executes.  The alternative approach, which
> > is
> > >> > what
> > >> > > I
> > >> > > > > >>> think
> > >> > > > > >>> > you are investigating, is to allow the connection to be
> > >> > > temporarily
> > >> > > > > >>> taken
> > >> > > > > >>> > offline and having any attempt by
> ControllerChannelManager
> > >> (or
> > >> > > > other
> > >> > > > > >>> > synchronous use case owner) to use the connection while
> it
> > >> is
> > >> > in
> > >> > > > this
> > >> > > > > >>> > "offline" state result in some kind of pause until the
> > >> > connection
> > >> > > > > comes
> > >> > > > > >>> > back online.  One issue with this approach might be the
> > >> length
> > >> > of
> > >> > > > > time
> > >> > > > > >>> that
> > >> > > > > >>> > the connection is unavailable; will it be offline for
> all
> > >> > > > > >>> authentication
> > >> > > > > >>> > requests and responses (ApiVersionsRequest/Response,
> > >> > > > > >>> > SaslHandshakeRequest/Response, and
> > SaslAuthenticateRequest/
> > >> > > > > Response)?
> > >> > > > > >>> > Note
> > >> > > > > >>> > the last one could actually be invoked multiple times,
> so
> > >> there
> > >> > > > could
> > >> > > > > >>> be 4
> > >> > > > > >>> > or more round-trips before the authentication "dance" is
> > >> > > finished.
> > >> > > > > >>> Will
> > >> > > > > >>> > the connection be "offline" the entire time, or will it
> be
> > >> > placed
> > >> > > > > back
> > >> > > > > >>> > "online" in between each request/response pair to allow
> > the
> > >> > owner
> > >> > > > of
> > >> > > > > >>> the
> > >> > > > > >>> > connection to use it -- in which case the authentication
> > >> > process
> > >> > > > > would
> > >> > > > > >>> have
> > >> > > > > >>> > to wait to get ownership again?  The approach I took
> > >> > interleaves
> > >> > > > the
> > >> > > > > >>> > authentication requests/responses with whatever the
> owner
> > is
> > >> > > doing,
> > >> > > > > so
> > >> > > > > >>> it
> > >> > > > > >>> > is conceivable that use of the connection jumps back and
> > >> forth
> > >> > > > > between
> > >> > > > > >>> the
> > >> > > > > >>> > two purposes.  Such jumping back and forth minimizes any
> > >> added
> > >> > > > > latency
> > >> > > > > >>> due
> > >> > > > > >>> > to the re-authentication, of course.
> > >> > > > > >>> >
> > >> > > > > >>> > Anyway, I'll look forward to finding out what you are
> able
> > >> to
> > >> > > > > conclude.
> > >> > > > > >>> > Good luck :-)
> > >> > > > > >>> >
> > >> > > > > >>> > Ron
> > >> > > > > >>> >
> > >> > > > > >>> > On Thu, Sep 6, 2018 at 5:17 AM Rajini Sivaram <
> > >> > > > > [email protected]
> > >> > > > > >>> >
> > >> > > > > >>> > wrote:
> > >> > > > > >>> >
> > >> > > > > >>> > > Hi Ron,
> > >> > > > > >>> > >
> > >> > > > > >>> > > Disconnections on the broker-side: I think we should
> do
> > >> > > > > >>> disconnections
> > >> > > > > >>> > as a
> > >> > > > > >>> > > separate KIP and PR as you originally intended. But
> that
> > >> one
> > >> > > > could
> > >> > > > > be
> > >> > > > > >>> > done
> > >> > > > > >>> > > separately without requiring KIP-368 as a pre-req. As
> a
> > >> > simpler
> > >> > > > > >>> > > implementation and one that can be used without
> KIP-368
> > in
> > >> > some
> > >> > > > > >>> cases, we
> > >> > > > > >>> > > could commit that first since this one may take
> longer.
> > >> > > > > >>> > >
> > >> > > > > >>> > > I wasn't suggesting that we do move re-authentication
> to
> > >> > > > > >>> NetworkClient, I
> > >> > > > > >>> > > was thinking of the lower layers, handling
> > authentication
> > >> and
> > >> > > > > >>> > > reauthentication at the same layer in a similar way.
> Let
> > >> me
> > >> > > look
> > >> > > > > >>> into the
> > >> > > > > >>> > > code and come up with a more detailed explanation to
> > avoid
> > >> > > > > confusion.
> > >> > > > > >>> > >
> > >> > > > > >>> > > I am not too concerned about the imports in
> > KafkaChannel.
> > >> As
> > >> > > you
> > >> > > > > >>> said, we
> > >> > > > > >>> > > can improve that if we need to. KafkaChannels are
> aware
> > of
> > >> > > > > >>> > > network/authentication states and if that becomes a
> bit
> > >> more
> > >> > > > > >>> complex, I
> > >> > > > > >>> > > don't think it would matter so much. My concern is
> about
> > >> > > changes
> > >> > > > > like
> > >> > > > > >>> > >
> > >> > > > > >>> > > https://github.com/apache/kafka/pull/5582/files#diff-
> > >> > > > > >>> > 987fef43991384a3ebec5fb55e53b577
> > >> > > > > >>> > > in ControllerChannelManager. Those classes shouldn't
> > have
> > >> > deal
> > >> > > > with
> > >> > > > > >>> SASL
> > >> > > > > >>> > or
> > >> > > > > >>> > > reauthentication. Anyway, I will get back with more
> > >> detail on
> > >> > > > what
> > >> > > > > I
> > >> > > > > >>> had
> > >> > > > > >>> > in
> > >> > > > > >>> > > mind since that may not be viable at all.
> > >> > > > > >>> > >
> > >> > > > > >>> > >
> > >> > > > > >>> > >
> > >> > > > > >>> > > On Thu, Sep 6, 2018 at 1:44 AM, Ron Dagostino <
> > >> > > [email protected]
> > >> > > > >
> > >> > > > > >>> wrote:
> > >> > > > > >>> > >
> > >> > > > > >>> > > > I just thought of another alternative if the imports
> > are
> > >> > the
> > >> > > > > >>> concern.
> > >> > > > > >>> > > > KafkaChannel could expose the fact that it can
> create
> > an
> > >> > > > > additional
> > >> > > > > >>> > > > Authenticator instance on the side (what I referred
> to
> > >> as
> > >> > > > > >>> > > > notYetAuthenticatedAuthenticator in the PR) and it
> > could
> > >> > let
> > >> > > > > >>> > > > kafka.server.KafkaApis drive the whole thing --
> create
> > >> the
> > >> > > > > >>> instance on
> > >> > > > > >>> > > the
> > >> > > > > >>> > > > side, clean it up if it fails, move it into place
> and
> > >> close
> > >> > > the
> > >> > > > > >>> old one
> > >> > > > > >>> > > if
> > >> > > > > >>> > > > it succeeds, etc.  Then KafkaChannel wouldn't need
> to
> > >> > import
> > >> > > > > >>> anything
> > >> > > > > >>> > new
> > >> > > > > >>> > > > -- it just exposes its Authenticator and the ability
> > to
> > >> > > perform
> > >> > > > > the
> > >> > > > > >>> > swap
> > >> > > > > >>> > > > upon success, etc.
> > >> > > > > >>> > > >
> > >> > > > > >>> > > > Ron
> > >> > > > > >>> > > >
> > >> > > > > >>> > > > On Wed, Sep 5, 2018 at 5:01 PM Ron Dagostino <
> > >> > > > [email protected]>
> > >> > > > > >>> > wrote:
> > >> > > > > >>> > > >
> > >> > > > > >>> > > > > Hi again, Rajini,  I realized a couple of
> potential
> > >> > > concerns
> > >> > > > > with
> > >> > > > > >>> > using
> > >> > > > > >>> > > > > the TransportLayer directly during
> > re-authentication.
> > >> > > >  First,
> > >> > > > > >>> in the
> > >> > > > > >>> > > > > blocking I/O use case, the owner of the
> > NetworkClient
> > >> > > > instance
> > >> > > > > >>> calls
> > >> > > > > >>> > > > > NetworkClientUtils.sendAndReceive() to send
> > requests.
> > >> > >  This
> > >> > > > > >>> method
> > >> > > > > >>> > > > > assumes the caller has exclusive access to the
> > >> > > NetworkClient,
> > >> > > > > so
> > >> > > > > >>> it
> > >> > > > > >>> > > does
> > >> > > > > >>> > > > > not check to see if the node is ready; it just
> sends
> > >> the
> > >> > > > > request
> > >> > > > > >>> and
> > >> > > > > >>> > > > > repeatedly calls poll() until the response
> arrives.
> > >> if
> > >> > we
> > >> > > > were
> > >> > > > > >>> to
> > >> > > > > >>> > take
> > >> > > > > >>> > > > > the connection temporarily offline that method
> > >> currently
> > >> > > has
> > >> > > > no
> > >> > > > > >>> > > mechanism
> > >> > > > > >>> > > > > for checking for such a state before it sends the
> > >> > request;
> > >> > > we
> > >> > > > > >>> could
> > >> > > > > >>> > add
> > >> > > > > >>> > > > it,
> > >> > > > > >>> > > > > but we would have to put in some kind of sleep
> loop
> > to
> > >> > keep
> > >> > > > > >>> checking
> > >> > > > > >>> > > for
> > >> > > > > >>> > > > it
> > >> > > > > >>> > > > > to come back "online" before it could send.
> Adding
> > >> such
> > >> > a
> > >> > > > > sleep
> > >> > > > > >>> loop
> > >> > > > > >>> > > > could
> > >> > > > > >>> > > > > be done, of course, but it doesn't sound ideal.
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > > A similar sleep loop situation exists in the async
> > use
> > >> > > case.
> > >> > > > > The
> > >> > > > > >>> > owner
> > >> > > > > >>> > > > > repeatedly calls poll() in a loop, but if the
> > >> connection
> > >> > to
> > >> > > > the
> > >> > > > > >>> node
> > >> > > > > >>> > is
> > >> > > > > >>> > > > > temporarily offline then the poll() method would
> > have
> > >> to
> > >> > > > enter
> > >> > > > > a
> > >> > > > > >>> > sleep
> > >> > > > > >>> > > > > loop until either the connection comes back online
> > or
> > >> the
> > >> > > > > timeout
> > >> > > > > >>> > > elapses
> > >> > > > > >>> > > > > (whichever comes first).
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > > I don't know if there is an aversion to adding
> sleep
> > >> > loops
> > >> > > > like
> > >> > > > > >>> that,
> > >> > > > > >>> > > so
> > >> > > > > >>> > > > > maybe it isn't a big issue, but I wanted to raise
> it
> > >> as a
> > >> > > > > >>> potential
> > >> > > > > >>> > > > concern
> > >> > > > > >>> > > > > with this approach.
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > > Also, is the import of SASL-specific classes in
> > >> > > KafkaChannel
> > >> > > > a
> > >> > > > > >>> major
> > >> > > > > >>> > > > > objection to the current implementation?  I could
> > >> > eliminate
> > >> > > > > that
> > >> > > > > >>> by
> > >> > > > > >>> > > > > replacing the 2 offending methods in KafkaChannel
> > with
> > >> > this
> > >> > > > one
> > >> > > > > >>> and
> > >> > > > > >>> > > > > having the implementation delegate to the
> > >> authenticator:
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > >     /**
> > >> > > > > >>> > > > >      * Respond to a re-authentication request.
> This
> > >> > occurs
> > >> > > on
> > >> > > > > the
> > >> > > > > >>> > > > >      * Server side of the re-authentication dance
> > >> (i.e.
> > >> > on
> > >> > > > the
> > >> > > > > >>> > broker).
> > >> > > > > >>> > > > >      *
> > >> > > > > >>> > > > >      * @param requestHeader
> > >> > > > > >>> > > > >      *            the request header
> > >> > > > > >>> > > > >      * @param request
> > >> > > > > >>> > > > >      *            the request to process
> > >> > > > > >>> > > > >      * @return the response to return to the
> client
> > >> > > > > >>> > > > >      */
> > >> > > > > >>> > > > >     public AbstractResponse
> > >> > respondToReauthenticationReque
> > >> > > > > >>> > > > st(RequestHeader
> > >> > > > > >>> > > > > requestHeader,
> > >> > > > > >>> > > > >             AbstractRequest request)
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > > There is practically no work being done in the
> > >> > KafkaChannel
> > >> > > > > >>> instance
> > >> > > > > >>> > > > > anyway -- it does some sanity checking but
> otherwise
> > >> > > > delegates
> > >> > > > > >>> to the
> > >> > > > > >>> > > > > authenticator.  We could just add a method to the
> > >> > > > Authenticator
> > >> > > > > >>> > > interface
> > >> > > > > >>> > > > > and delegate the whole thing.
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > > Ron
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > > On Wed, Sep 5, 2018 at 2:07 PM Ron Dagostino <
> > >> > > > > [email protected]>
> > >> > > > > >>> > > wrote:
> > >> > > > > >>> > > > >
> > >> > > > > >>> > > > >> Hi Rajini.  I'm now skeptical of my
> > "ConnectionState.
> > >> > > > > >>> > REAUTHENTICATING"
> > >> > > > > >>> > > > idea.
> > >> > > > > >>> > > > >> The concept of a connection being "READY" or not
> > can
> > >> > > impact
> > >> > > > > >>> > > > >> ConsumerCoordinator (see, for example,
> > >> > > > > >>> > > > >>
> > >> https://github.com/apache/kafka/blob/trunk/clients/src/
> > >> > > > > >>> > > > main/java/org/apache/kafka/cli
> > ents/consumer/internals/
> > >> > > > > >>> > > > ConsumerCoordinator.java#L352).
> > >> > > > > >>> > > > >> The semantics of a connection being "READY" and
> the
> > >> > > > > >>> > > > >> implications/assumptions are not clear, and I
> > suspect
> > >> > > there
> > >> > > > > >>> will be
> > >> > > > > >>> > > some
> > >> > > > > >>> > > > >> unintended consequences of this approach that may
> > >> not be
> > >> > > > > >>> immediately
> > >> > > > > >>> > > > >> apparent.
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >> I will confess that when I was implementing
> > >> > > > re-authentication
> > >> > > > > I
> > >> > > > > >>> > > thought
> > >> > > > > >>> > > > >> it might be worthwhile to unify the
> > >> > authentication-related
> > >> > > > > code
> > >> > > > > >>> > bases
> > >> > > > > >>> > > --
> > >> > > > > >>> > > > >> except I suspected it would be good to go in the
> > >> other
> > >> > > > > >>> direction:
> > >> > > > > >>> > have
> > >> > > > > >>> > > > the
> > >> > > > > >>> > > > >> current code that directly uses the
> TransportLayer
> > >> > instead
> > >> > > > use
> > >> > > > > >>> > > > >> AuthenticationSuccessOrFailureReceiver and
> > >> > > > > >>> > > > AuthenticationRequestEnqueuer.
> > >> > > > > >>> > > > >> I'm not advocating that we do it -- I decided to
> > not
> > >> go
> > >> > > > there
> > >> > > > > >>> when
> > >> > > > > >>> > > > creating
> > >> > > > > >>> > > > >> the PR, after all -- but I did get a strong
> feeling
> > >> that
> > >> > > > > >>> directly
> > >> > > > > >>> > > using
> > >> > > > > >>> > > > the
> > >> > > > > >>> > > > >> TransportLayer as is currently done is really
> only
> > >> > viable
> > >> > > > > before
> > >> > > > > >>> > > anybody
> > >> > > > > >>> > > > >> else starts using the connection.  If we want to
> > use
> > >> the
> > >> > > > > >>> > > TransportLayer
> > >> > > > > >>> > > > again
> > >> > > > > >>> > > > >> after that point then it is up to us to somehow
> > take
> > >> the
> > >> > > > > >>> connection
> > >> > > > > >>> > > > >> "temporarily offline" so that we have exclusive
> > >> rights
> > >> > to
> > >> > > it
> > >> > > > > >>> again,
> > >> > > > > >>> > > and
> > >> > > > > >>> > > > I
> > >> > > > > >>> > > > >> wonder if the concept of a connection being
> > >> "temporarily
> > >> > > > > >>> offline" is
> > >> > > > > >>> > > > >> something the existing code is able to handle --
> > >> > probably
> > >> > > > not,
> > >> > > > > >>> and I
> > >> > > > > >>> > > > >> suspect there are unstated assumptions that would
> > be
> > >> > > > > >>> invalidated.
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >> Do you think this particular "ConnectionState.
> > >> > > > > REAUTHENTICATING"
> > >> > > > > >>> > idea
> > >> > > > > >>> > > is
> > >> > > > > >>> > > > >> worth pursuing?  How about the general idea of
> > >> trying to
> > >> > > use
> > >> > > > > the
> > >> > > > > >>> > > > >> TransportLayer directly -- are you still feeling
> > >> like it
> > >> > > is
> > >> > > > > >>> viable?
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >> Ron
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >> Ron
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >> On Wed, Sep 5, 2018 at 11:40 AM Ron Dagostino <
> > >> > > > > >>> [email protected]>
> > >> > > > > >>> > > > wrote:
> > >> > > > > >>> > > > >>
> > >> > > > > >>> > > > >>> <<<in favor of implementing server-side kill in
> > >> > addition
> > >> > > to
> > >> > > > > >>> > > > >>> re-authentication, not as a replacement.
> > >> > > > > >>> > > > >>> <<<I think Rajini suggested the same thing
> > >> > > > > >>> > > > >>> Oh, ok, I misunderstood.  Then I think we are on
> > the
> > >> > same
> > >> > > > > >>> page: if
> > >> > > > > >>> > we
> > >> > > > > >>> > > > >>> are going to do re-authentication, then we
> should
> > >> also
> > >> > do
> > >> > > > > >>> > > > >>> server-side-kill-upon-expiration as part of the
> > same
> > >> > > > > >>> > implementation.
> > >> > > > > >>> > > > I'm
> > >> > > > > >>> > > > >>> good with that.
> > >> > > > > >>> > > > >>>
> > >> > > > > >>> > > > >>> I am also looking into Rajini's idea of doing
> > >> > > > > >>> re-authentication at
> > >> > > > > >>> > > the
> > >> > > > > >>> > > > >>> NetworkClient level and reusing the existing
> > >> > > authentication
> > >> > > > > >>> code
> > >> > > > > >>> > > > path.  I
> > >> > > > > >>> > > > >>> was skeptical when she suggested it, but now
> that
> > I
> > >> > look
> > >> > > > > >>> closer I
> > >> > > > > >>> > see
> > >> > > > > >>> > > > >>> something that I can try.  NetworkClient has
> logic
> > >> to
> > >> > > > > >>> recognize the
> > >> > > > > >>> > > > >>> state "ConnectionState.CONNECTING" as meaning
> "you
> > >> > can't
> > >> > > do
> > >> > > > > >>> > anything
> > >> > > > > >>> > > > >>> with this connection at the moment; please
> wait."
> > >> I'm
> > >> > > > going
> > >> > > > > >>> to try
> > >> > > > > >>> > > > adding
> > >> > > > > >>> > > > >>> a new state "ConnectionState.REAUTHENTICATING"
> > that
> > >> > would
> > >> > > > be
> > >> > > > > >>> > > > recognized
> > >> > > > > >>> > > > >>> in a similar way.  Then the challenge becomes
> > >> inserting
> > >> > > > > myself
> > >> > > > > >>> into
> > >> > > > > >>> > > any
> > >> > > > > >>> > > > >>> existing flow that might be going on.  I'll
> > probably
> > >> > add
> > >> > > > the
> > >> > > > > >>> > request
> > >> > > > > >>> > > > to set
> > >> > > > > >>> > > > >>> the state to "REAUTHENTICATING" to a queue if I
> > >> can't
> > >> > > grab
> > >> > > > > the
> > >> > > > > >>> > state
> > >> > > > > >>> > > > >>> immediately and have the network client's poll()
> > >> method
> > >> > > > check
> > >> > > > > >>> at
> > >> > > > > >>> > the
> > >> > > > > >>> > > > >>> end to see if any such requests can be granted;
> > >> there
> > >> > > would
> > >> > > > > be
> > >> > > > > >>> a
> > >> > > > > >>> > > > callback
> > >> > > > > >>> > > > >>> associated with the request, and that way I can
> be
> > >> > > assured
> > >> > > > I
> > >> > > > > >>> would
> > >> > > > > >>> > be
> > >> > > > > >>> > > > >>> granted the request in a reasonable amount of
> time
> > >> > > > (assuming
> > >> > > > > >>> the
> > >> > > > > >>> > > > connection
> > >> > > > > >>> > > > >>> doesn't close in the meantime).  Then it would
> be
> > up
> > >> > the
> > >> > > > > >>> callback
> > >> > > > > >>> > > > >>> implementation to perform the re-authentication
> > >> dance
> > >> > and
> > >> > > > set
> > >> > > > > >>> the
> > >> > > > > >>> > > state
> > >> > > > > >>> > > > >>> back to "ConnectionState.READY".  I don't know
> if
> > >> this
> > >> > > will
> > >> > > > > >>> work,
> > >> > > > > >>> > and
> > >> > > > > >>> > > > >>> I'm probably missing some subtleties at the
> > moment,
> > >> but
> > >> > > > I'll
> > >> > > > > >>> give
> > >> > > > > >>> > it
> > >> > > > > >>> > > a
> > >> > > > > >>> > > > shot
> > >> > > > > >>> > > > >>> and see what happens.
> > >> > > > > >>> > > > >>>
> > >> > > > > >>> > > > >>> Ron
> > >> > > > > >>> > > > >>>
> > >> > > > > >>> > > > >>> On Wed, Sep 5, 2018 at 11:23 AM Colin McCabe <
> > >> > > > > >>> [email protected]>
> > >> > > > > >>> > > > wrote:
> > >> > > > > >>> > > > >>>
> > >> > > > > >>> > > > >>>> On Wed, Sep 5, 2018, at 07:34, Ron Dagostino
> > wrote:
> > >> > > > > >>> > > > >>>> > I added a "How To Support Re-Authentication
> for
> > >> > Other
> > >> > > > SASL
> > >> > > > > >>> > > > Mechanisms"
> > >> > > > > >>> > > > >>>> > section to the KIP as Rajini suggested.  I
> also
> > >> > added
> > >> > > a
> > >> > > > > >>> > "Rejected
> > >> > > > > >>> > > > >>>> > Alternative" for the idea of forcibly closing
> > >> > > > connections
> > >> > > > > >>> on the
> > >> > > > > >>> > > > >>>> client
> > >> > > > > >>> > > > >>>> > side upon token refresh or on the server side
> > >> upon
> > >> > > token
> > >> > > > > >>> > > expiration.
> > >> > > > > >>> > > > >>>> It
> > >> > > > > >>> > > > >>>> > may be a bit premature to reject the
> > server-side
> > >> > kill
> > >> > > > > >>> scenario
> > >> > > > > >>> > > given
> > >> > > > > >>> > > > >>>> that
> > >> > > > > >>> > > > >>>> > Colin and Rajini are partial to it, but see
> > below
> > >> > for
> > >> > > > > what I
> > >>

Re: [DISCUSS] KIP 368: Allow SASL Connections to Periodically Re-Authenticate

Reply via email to