Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Chris Egerton Tue, 05 Sep 2023 08:21:09 -0700

Hi all,

Thank you so much for the generous review comments! Happy to see interest
in this feature. Inline responses follow.

Ashwin:

> Don't you foresee a future scope type which may require cluster metadata ?
In that case, isn't it better to forward the requests to the leader in the
initial implementation ?

I agree with Yash here: we can cross that bridge when we come to it, unless
there are problems that we'd encounter then that would be better addressed
by adding request forwarding now. One potential argument I can think of is
that the UX would be a little strange if the semantics for this endpoint
differ depending on the value of the scope parameter (some values would be
affected by in-progress rebalances, and some would not), but I don't know
if making scope=cluster more brittle in the name of consistency with, e.g.,
scope=connectorType:foo is really a worthwhile tradeoff. Thoughts?

> I would also recommend an additional system test for Standalone herder to
ensure that the new scope parameter is honored and the response contains
the last modified time.

Ah, great call! I love the new testing plan section. I also share Yash's
concerns about adding a new system test though (at this point, they're so
painful to write, test, and debug that in most circumstances I consider
them a last resort). Do you think it'd be reasonable to add end-to-end
verification for this with an integration test instead?

Yash:

> From the proposed changes section, it isn't very clear to me how we'll be
tracking this last modified timestamp to be returned in responses for the
*GET
/admin/loggers* and *GET /admin/loggers/{logger}* endpoints. Could you
please elaborate on this? Also, will we track the last modified timestamp
even for worker scoped modifications where we won't write any records to
the config topic and the requests will essentially be processed
synchronously?

RE timestamp tracking: I was thinking we'd store the timestamp for each
level in-memory and, whenever we change the level for a namespace, update
its timestamp to the current wallclock time. Concretely, this means we'd
want the timestamp for some logger `logger` to be as soon as possible after
the call to `logger.setLevel(level)` for some level `level`. I'm honestly
unsure how to clarify this further in the KIP; is there anything in there
that strikes you as particularly ambiguous that we can tweak to be more
clear?

RE scope distinction for timestamps: I've updated the KIP to clarify this
point, adding this sentence: "Timestamps will be updated regardless of
whether the namespace update was applied using scope=worker or
scope=cluster.". Let me know what you think

> In the current synchronous implementation for the *PUT
/admin/loggers/{logger} *endpoint, we return a 404 error if the level is
invalid (i.e. not one of TRACE, DEBUG, WARN etc.). Since the new cluster
scoped variant of the endpoint will be asynchronous, can we also add a
validation to synchronously surface erroneous log levels to users?

Good call! I think we don't have to be explicit about this in the proposed
changes section, but it's a great fit for the testing plan, where I've
added it: "Ensure that cluster-scoped requests with invalid logging levels
are rejected with a 404 response"

> I'm curious to know what the rationale here is? In KIP-745, the stated
reasoning behind ignoring restart requests during worker startup was that
the worker will anyway be starting all connectors and tasks assigned to it
so the restart request is essentially meaningless. With the log level API
however, wouldn't it make more sense to apply any cluster scoped
modifications to new workers in the cluster too? This also applies to any
workers that are restarted soon after a request is made to *PUT
/admin/loggers/{logger}?scope=cluster *on another worker. Maybe we could
stack up all the cluster scoped log level modification requests during the
config topic read at worker startup and apply the latest ones per namespace
(assuming older records haven't already been compacted) after we've
finished reading to the end of the config topic?

It's definitely possible to implement persistent cluster-wide logging level
adjustments, with the possible exception that none will be applied until
the worker has finished reading to the end of the config topic. The
rationale for keeping these updates ephemeral is to continue to give
priority to workers' Log4j configuration files, with the underlying
philosophy that this endpoint is still only intended for debugging
purposes, as opposed to cluster-wide configuration. Permanent changes can
already be made by tweaking the Log4j file for a worker and then restarting
it. If a restart is too expensive for a permanent change, then the change
can be applied immediately via the REST API, and staged via the Log4j
configuration file (which will then be used the next time the worker is
restarted, whenever that happens).

We could potentially explore persistent cluster-wide logging adjustments in
the future (or, if people feel strongly, right now), but it'd be more
complex. How would we handle the case where a user has updated their Log4j
configuration file but continues to see the same logging level for a
namespace, in a way that greatly reduces or even eliminates the odds of
headaches or footguns? How would we allow users to reset or undo
cluster-wide configuration updates, in order to revert back to whatever's
specified in their Log4j configuration file? Would we want to give users
control over whether a dynamic update is persistent or ephemeral? What
would the implications be for finer-grained scoping (especially with
regards to resetting dynamic changes)? Overall it doesn't seem worth the
work to tackle all of this right now, as long as we're not painting
ourselves into a corner. But as always, happy to hear what you and others
think!

Federico:

> Due to the idempotent nature of PUT, I guess that the last_modified
timestamp won't change if the same request is repeated successively.
Should we add a unit test for that?

Great call! I've added this to the testing plan with unit and end-to-end
tests (seems fairly cheap to do both; LMK if e2e feels like overkill
though).

Thanks again for all of your thoughts!

Cheers,

Chris

On Mon, Sep 4, 2023 at 2:53 AM Federico Valeri <fedeval...@gmail.com> wrote:

> Hi Chris, thanks. This looks like a useful feature.
>
> Due to the idempotent nature of PUT, I guess that the last_modified
> timestamp won't change if the same request is repeated successively.
> Should we add a unit test for that?
>
> On Mon, Sep 4, 2023 at 6:17 AM Ashwin <apan...@confluent.io.invalid>
> wrote:
> >
> > Hi Chris,
> >
> > Thanks for thinking about this useful feature !
> > I had a question regarding
> >
> > > Since cluster metadata is not required to handle these types of
> request,
> > they will not be forwarded to the leader
> >
> > And later, we also mention about supporting more scope types in the
> future.
> > Don't you foresee a future scope type which may require cluster metadata
> ?
> > In that case, isn't it better to forward the requests to the leader in
> the
> > initial implementation ?
> >
> > I would also recommend an additional system test for Standalone herder to
> > ensure that the new scope parameter is honored and the response contains
> > the last modified time.
> >
> > Thanks,
> > Ashwin
> >
> > On Sat, Sep 2, 2023 at 5:12 AM Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > Can't imagine a worse time to publish a new KIP (it's late on a Friday
> and
> > > we're in the middle of the 3.6.0 release), but I wanted to put forth
> > > KIP-976 for discussion:
> > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-976%3A+Cluster-wide+dynamic+log+adjustment+for+Kafka+Connect
> > >
> > > TL;DR: The API to dynamically adjust log levels at runtime with
> Connect is
> > > great, and I'd like to augment it with support to adjust log levels for
> > > every worker in the cluster (instead of just the worker that receives
> the
> > > REST request).
> > >
> > > I look forward to everyone's thoughts, but expect that this will
> probably
> > > take a bump or two once the dust has settled on 3.6.0. Huge thanks to
> > > everyone that's contributed to that release so far, especially our
> release
> > > manager Satish!
> > >
> > > Cheers,
> > >
> > > Chris
> > >
>

Re: KIP-976: Cluster-wide dynamic log adjustment for Kafka Connect

Reply via email to