Re: [DISCUSS] KIP-663: API to Start and Shut Down Stream Threads and to Request Closing of Kafka Streams Clients

Bruno Cadonna Thu, 17 Sep 2020 09:34:26 -0700

Hi Sophie,

Thank you for the feedback! I replied inline.


Best,
Bruno

On 16.09.20 19:19, Sophie Blee-Goldman wrote:


We guarantee that the metadata of the dead stream threads  will be
returned by KafkaStreams#localThreadsMetadata() at least until the next
call to KafkaStreams#addStreamThread() or
KafkaStreams#removeStreamThread() after the stream thread transited to
DEAD



This seems kind of tricky...personally I would find it pretty odd if I
queried the
local thread metadata and found two threads, A (alive) and B (dead), and
then
called removeStreamThread() and now suddenly I have zero. Or if I call
addStreamThread and now I still have two threads.

The behavior might be unusual, but it is well defined and not random byany means.

Both of those results seem to indicate that only live threads "count" and
are returned
by localThreadsMetadata(). But in reality we do temporarily keep the dead
thread,
but only for the arbitrary amount of time until the next time you want to
add or
remove some other stream thread? That seems like a weird side effect of the
add/removeStreamThread APIs.

This is not a side effect that just happens to occur. This is aguarantee that users get. It gives users the possibility to retrieve themetadata of the dead stream threads since the last call toadd/removeStreamThread. Admittedly, this guarantee overlap with thecurrent/planned implementation. But that is more a coincidence.

I would be more concerned about when add/removeStreamThread is calledfrom different threads which could happen if an uncaught exceptionhandler is called that wants to replace a stream thread and a threadthat is responsible for automated scaling up is running.

If we really think users might want to log the metadata of dead threads,
then
let's just do that for them or give them a way to do exactly that.

Logging the metatdata of dead stream threads for the user is a validalternative. Giving users the way to do exactly that is hard because theStreamThread class is not part of the public API. They would always needto call a method on the KafkaStreams object where we already havelocalThreadsMetadata().

I'm not that concerned about the backwards compatibility of removing dead
threads from the localThreadsMetadata, because I find it hard to believe
that
users do anything other than just skip over them in the list (set?) that
gets
returned. But maybe someone can chime in with an example use case.

I am also not too much concerned about backwards compatibility. Thatwould indeed be a side effect of the current proposal.

I'm actually even a little skeptical that any users might want to log the
metadata of a
dead thread, since all of the metadata is only useful for IQ on live
threads or
already covered by other easily discoverable logging elsewhere, or both.

Said all of the above, I actually agree with you that there is not thatmuch information in the metadata of a dead stream thread that isinteresting. The name of the stream thread is known in the uncaughtexception handler. The names of the clients, like consumer etc., used bythe stream thread can be derived from the name of the stream thread.Finally, the sets of active and standby tasks should be empty for a deadstream thread.

Hence, I backpedal and propose to filter out dead stream threads fromlocalThreadsMetadata(). WDYT?

On Wed, Sep 16, 2020 at 2:07 AM Bruno Cadonna <br...@confluent.io> wrote:

Hi again,

I just realized that if we filter out DEAD stream threads in
localThreadsMetadata(), users cannot log the metadata of dying stream
threads in the uncaught exception handler.

I realized this thanks to the example Guozhang requested in the KIP.
Thank you for that, Guozhang!

Hence, I adapted the KIP as follows:

- We do not filter out DEAD stream threads in
KafkaStreams#localThreadsMetadata()

- We guarantee that the metadata of the dead stream threads  will be
returned by KafkaStreams#localThreadsMetadata() at least until the next
call to KafkaStreams#addStreamThread() or
KafkaStreams#removeStreamThread() after the stream thread transited to
DEAD. Besides giving users the opportunity to log the metadata of a
dying stream thread in its uncaught exception handler, this guarantee
makes KafkaStreams#localThreadsMetadata() completely backward compatible
to the current behavior, because if KafkaStreams#addStreamThread() and
KafkaStreams#removeStreamThread() are never called,
KafkaStreams#localThreadsMetadata() will also return the metadata of all
streams threads that have ever died which corresponds to the current
behavior.

- We guarantee that dead stream threads are removed from a Kafka Streams
client at latest after the next call to KafkaStreams#addStreamThread()
or KafkaStreams#removeStreamThread() following the transition of the
stream thread to DEAD. This guarantees that the number of maintained
stream threads does not grow indefinitely.


Best,
Bruno



On 16.09.20 09:23, Bruno Cadonna wrote:

Hi Guozhang,

Good point! I would propose to filter out DEAD stream threads in
localThreadsMetadata() to get consistent results that do not depend on
timing. I will update the KIP accordingly.

Best,
Bruno

On 16.09.20 06:02, Guozhang Wang wrote:

Thanks Bruno, your replies make sense to me. As for
localThreadsMetadata() itself,
I'd like to clarify if it would return any still-bookkept threads, or
would
it specifically filter out those DEAD threads even if they are not yet
removed.

Otherwise, the KIP LGTM.

Guozhang

On Tue, Sep 15, 2020 at 2:58 AM Bruno Cadonna <br...@confluent.io>

wrote:

Hi Guozhang,

Thank you for your feedback. I replied inline.

Best,
Bruno

On 09.09.20 23:43, Guozhang Wang wrote:

Hello Bruno,

Finally got some time to review your KIP and the discussion thread
now..

few comments below:

1) I'm with Matthias about the newly added numberOfAliveStreamThreads

v.s.

existing localThreadsMetadata: to me it seems we can always achieve

the

first based on the second. It seems not worthy to provide some "syntax
sugar" to the API but just let users do the filtering themselves.


I am not married to that method. I removed it.

Furthermore, I'm wondering what's the rationale behind removing the
DEAD
threads from localThreadsMetadata()? Personally I feel returning all
threads, including those who are ever closed, either due to
exception or
due to removeStreamThread, would be beneficial for debugging
purposes, as
long as within the lifetime of an instance we expect the amount of

such

dead threads will not increase linearly --- and if we agree with that,
maybe we can rename "removeStreamThread" to sth. like
"terminateStreamThread" indicating it is only terminated but not
removed
--- and of course if users do not want to see those DEAD threads
they can
always filter them out. I'm just proposing that we should still
leave the
door open for those who want to check those ever terminated threads.


I actually think the number of dead stream threads might increase
linearly. Assume users have a systematic error that continuously kills

stream thread and they blindly start a new stream thread in the

uncaught

exception handler. This scenario might be a mistake but if the
systematic error does not occur at a high rate, it could also be a
strategy to keep the application running during the investigation of

the

systematic error.

IMO, removing dead stream threads makes Kafka Streams more robust
because it prevent a possibly unbounded increase of memory usage. If
users want to debug the dead stream threads they can monitor the number
of dead threads with the metric proposed in the KIP and they could
additionally log the metadata of the dying stream thread in the

uncaught

exception handler. I do not think that there is need to keep dead

stream

threads around.

2) I think it would help to write down some example user code in

exception

handler e.g. to illustrate how this would be implemented -- e.g. we
know
that practically the handler need to maintain a "this" reference of

the

instance anyways in order to shutdown the whole instance or,

add/terminate

threads dynamically, but I want to see if we have listed all possible

call

paths like: a) a thread's handler logic to terminate another thread,
b) a
thread handler to add new threads, etc are all appropriately supported
without deadlocks.


I added an example for an uncaught exception handler that adds a stream
thread to the KIP. Removing a stream thread in an uncaught exception
handler doesn't seem a common use case to me. Nevertheless, we need to
make sure that we do not run in a deadlock in that case. I will

consider

that during the implementation and write tests to check for deadlocks.

Shutting down the Kafka Streams client from inside an uncaught

exception

handler is outside the scope of this KIP. In the beginning it was part
of the KIP, but during the discussion it turned out that we can fix our
existing close() method to accomplish the shutdown from inside an
uncaught exception handler. But I completely agree with you that we

need

to ensure that we do not run into a deadlock in this case.


Guozhang


On Wed, Sep 9, 2020 at 11:35 AM Matthias J. Sax <mj...@apache.org>

wrote:

I would prefer to not add a new method. It seems unnecessary.
`localThreadMetadata` does return all threads in all states(*) and
thus
provides full insight.

(*) A thread in state DEAD could be returned as long as it's not
removed
yet.

I don't see any advantage to pre-filter threads and to exclude

threads

in state CREATE or PENDING_SHUTDOWN. Even if a CREATED thread is not
started yet, it is still "alive" in a broader sense. For example, if

user wants to scale out to 10 thread, and 8 are RUNNING and 2 are in
state CREATED, a user won't need to add 2 more threads -- there are
already 10 threads.

For PENDING_SHUTDOWN and scale in it would be different I guess, as
the
proposal would be to filter them out right away. However, filtering
them
seems actually not to be "correct", as a thread in PENDING_SHUTDOWN
might still process data and it's thus still "alive".

If there is still a need later to add a new method about "alive
thread"
we can always add as a follow up -- removing things is much harder.

I also don't think that there is value in returning names of dead
threads, as we recycle names.


-Matthias


On 9/9/20 10:04 AM, Sophie Blee-Goldman wrote:

I agree that the current behavior of localThreadsMetadata() does not

seem

to match, but it seems like we will be forced to change it to only

return

currently-alive threads. For one thing, we plan to recycle old

thread

names.

It would be pretty confusing for a user to get two (or more)

ThreadMetadata

objects returned with the same name, since AFAICT this is the only
distinguishing identifier of stream threads. I think we should
enforce

that

only live threads are returned by localThreadsMetadata(). Plus, as

Matthias

pointed out, we plan to remove dead threads from the KafkaStreams

client,

so still returning them in the metadata would be extremely odd.

If we think that there might be some use case that requires knowing

which

threads have died, we could consider adding a method that returns

the

names of dead threads. But the only use case I can imagine would

probably

be better served by a callback that gets invoked when the thread
dies,

which

we already have.

On Tue, Sep 8, 2020 at 11:46 PM Bruno Cadonna <br...@confluent.io>

wrote:

Hi Matthias and Sophie,

I agree that localThreadsMetadata() can be used here. However,
localThreadsMetadata() returns all stream threads irrespectively of
their states. Alive stream threads are specified as being in one of

the

following states: RUNNING, STARTING, PARTITIONS_REVOKED, and
PARTITIONS_ASSIGNED. Hence, users would need to filter the result

of

localThreadsMetadata(). I thought, it would be neat to have a

method

that hides this filtering and returns the number of alive stream
threads, because that is the most basic information you might
need to
decide about adding or removing stream threads. For all more
advanced
use cases users should use localThreadsMetadata(). I am also happy

with

removing the method. WDYT?

Best,
Bruno

On 09.09.20 03:51, Matthias J. Sax wrote:

Currently we, don't cleanup dead threads, but the KIP proposes to

change

this:

Stream threads that are in state DEAD will be removed from the

stream

threads of a Kafka Streams client.



-Matthias

On 9/8/20 2:37 PM, Sophie Blee-Goldman wrote:

Ah, I forgot about localThreadsMetadata(). In that. case I agree,

there's

no reason
to introduce a new method when we can get both the names and
number

of

all

running threads from this.

I assume that we would update localThreadsMetadata to only return

currently

alive threads as part of this KIP -- at a quick glance, it seems

like

we

don't do
any pruning of dead threads at the moment

On Tue, Sep 8, 2020 at 1:58 PM Matthias J. Sax <mj...@apache.org

wrote:

I am not sure if we need a new method? There is already
`localThreadsMetadata()`. What do we gain by adding a new one?

Returning the thread's name (as `Optional<String>`) for both
add()

and

remove() is fine with me.


-Matthias

On 9/8/20 12:58 PM, Sophie Blee-Goldman wrote:

Sorry Bruno, I think I missed the end of your message with the
numberOfAliveStreamThreads()
proposal. I agree, that would be better than the alternatives I

listed.

That said:

They rather suggest that the method returns a list of
handles to

the

stream threads.

I hadn't thought of that originally, but now that you mention
it,

this

might be a good idea.
I don't think we should return actual handles on the threads,
but

maybe a

list of the thread
names rather than a single number of currently alive threads.

Since we seem to think it would be difficult if not
impossible to

keep

track of the number
of running stream threads, we should apply the same reasoning

to

the

names

and not
assume the user can/will keep track of every thread returned by
addStreamThread() or
removeStreamThread(). Users should generally take any required

action

immediately
after adding/removing the thread -- eg deregistering the thread

metrics

--

but it might
still be useful to provide a convenience method listing all
of the

current

threads

And of course you could still get the number of threads
easily by

invoking

size() on the
returned list (or ordered set?).

On Tue, Sep 8, 2020 at 12:16 PM Bruno Cadonna
<br...@confluent.io

wrote:

Thank you again for the feedback Sophie!

As I tried to point out in my previous e-mail, removing a
stream

thread

from a Kafka Streams client that does not have alive stream

threads

is

nothing exceptional for the client per se. However, it can
become
exceptional within the context of the user. For example, if
users

want

to remove a stream thread from a client without alive stream

threads

because one if their metrics say so, then this is
exceptional in

the

context of that user metric not in the context of the Kafka

Streams

client. In that case, users should throw an exception and
handle

it.


Regarding returning null, I do not like to return null because

from a

development point of view there is no distinction between

returning

null

because we have a bug in the code or returning null because
there

are no

alive stream threads. Additionally, Optional<String> makes it

more

explicit that the result could also be empty.

Thank you for the alternative method names! However, with the

names

you

propose it is not immediately clear that the method returns an

amount of

stream threads. They rather suggest that the method returns a

list

of

handles to the stream threads. I chose to use

"aliveStreamThreads"

to be

consistent with the client-level metric "alive-stream-threads"

which

reports the same number of stream threads that
numberOfAliveStreamThreads() should report. If others also
think

that

the proposed name in the KIP is too clumsy, I am open to

rename

it,

though.


Best,
Bruno


On 08.09.20 20:12, Sophie Blee-Goldman wrote:

it's never a good sign when the discussion moves into the
vote

thread


Hah, sorry, the gmail consolidation of [VOTE] and [DISCUSS]

threads

strikes

again.
Thanks for redirecting me Bruno

I suppose it's unfair to expect the callers to keep perfect

track

of

the

current
      number of stream threads, but it also seems like you

shouldn't

be

calling

removeStreamThread() when there are no threads left. Either

you're

just

haphazardly removing threads and could unintentionally slip

into a

state

of

no
running threads without realizing it, or more realistically,

you're

carefully
removing threads based on some metric(s) that convey
whether the

system

is

over or under-provisioned. If your metrics say you're

over-provisioned

but

there's
not one thread running, well, that certainly sounds
exceptional

to

me.

Or

you might
be right in that the cluster is over-provisioned but have

just

been

directing the
removeStreamThread() and addStreamThread() calls to
instances at

random,

and

end up with one massive instance and one with no threads at
all.

Again,

this
probably merits some human intervention (or system redesign)

That said, I don't think there's any real harm to just
returning

null

in

this case, but I hope
that users would pay attention to this since it seems
likely to

indicate

something has gone
seriously wrong. I suppose Optional<String> would be a

reasonable

compromise.

As for the method name, what about activeStreamThreads() or
liveStreamThreads() ?

On Mon, Sep 7, 2020 at 1:45 AM Bruno Cadonna <

br...@confluent.io>

wrote:

Hi John,

I agree with you except for checking null. I would rather

prefer

to

use

Optional<String> as the return type to both methods.

I changed the subject from [VOTE] to [DISCUSS] so that we

can

follow

up

in the discussion thread.

Best,
Bruno

On 04.09.20 23:12, John Roesler wrote:

Hi Sophie,

Uh, oh, it's never a good sign when the discussion moves
into the vote thread :)

I agree with you, it seems like a good touch for
removeStreamThread() to return the name of the thread that
got removed, rather than a boolean flag. Maybe the return
value would be `null` if there is no thread to remove.

If we go that way, I'd suggest that addStreamThread() also
return the name of the newly created thread, or null if no
thread can be created right now.

I'm not completely sure if I think that callers of this
method would know exactly how many threads there are. Sure,
if a human being is sitting there looking at the metrics or
logs and decides to call the method, it would work out, but
I'd expect this kind of method to find its way into
automated tooling that reacts to things like current system
load or resource saturation. Those kinds of toolchains

often

are part of a distributed system, and it's probably not

that

easy to guarantee that the thread count they observe is
fully consistent with the number of threads that are
actually running. Therefore, an in-situ `int
numStreamThreads()` method might not be a bad idea. Then
again, it seems sort of optional. A caller can catch an
exception or react to a `null` return value just the same
either way. Having both add/remove methods behave similarly
is probably more valuable.

Thanks,
-John


On Thu, 2020-09-03 at 12:15 -0700, Sophie Blee-Goldman
wrote:

Hey, sorry for the late reply, I just have one minor

suggestion.

Since

we

don't
make any guarantees about which thread gets removed or
allow

the

user

to

specify, I think we should return either the index or full

name

of

the

thread
that does get removed by removeThread().

I know you just updated the KIP to return true/false if
there

are/aren't any

threads to be removed, but I think this would be more

appropriate as

an

exception than as a return type. I think it's reasonable

to

expect

users to

have some sense to how many threads are remaining, and not

try

to

remove

a thread when there is none left. To me, that indicates

something

wrong

with the user application code and should be treated as an

exceptional

case.

I don't think the same code clarify argument applies
here as

to

the

addStreamThread() case, as there's no reason for an

application

to

be

looping and retrying removeStreamThread()  since if that

fails,

it's

because

there are no threads left and thus it will continue to
always

fail.

And

if

the
user actually wants to shut down all threads, they should

just

close

the

whole application rather than call removeStreamThread()
in a

loop.


While I generally think it should be straightforward for

users

to

track

how

many stream threads they have running, maybe it would be
nice

to

add

a small utility method that does this for them. Something

like


// Returns the number of currently alive threads
boolean runningStreamThreads();

On Thu, Sep 3, 2020 at 7:41 AM Matthias J. Sax <

mj...@apache.org

wrote:

+1 (binding)

On 9/3/20 6:16 AM, Bruno Cadonna wrote:

Hi,

I would like to start the voting on KIP-663 that

proposes

to

add

methods

to the Kafka Streams client to add and remove stream

threads

during

execution.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-663%3A+API+to+Start+and+Shut+Down+Stream+Threads


Best,
Bruno

Re: [DISCUSS] KIP-663: API to Start and Shut Down Stream Threads and to Request Closing of Kafka Streams Clients

Reply via email to