I'm moving the discussion on KIP-740 to an actual [DISCUSS] thread since it
turned out to be more nuanced
and grew in scope since I initially proposed it. Thanks John for
the suggestions/questions on the [VOTE] thread,
I have copied them over and addressed them below.

Before that, I just want to apologize to everyone for bringing this to a
vote immediately. I truly thought this would
be a trivial KIP that just fixed what seemed like a bug in the API, but it
slowly grew in scope as I wrangled with
the question of whether TaskId was already a public API. To clarify, it is
-- a TaskId is returned by the
Processor/StateStoreContext.taskId() methods, I just hadn't noticed those
APIs at the time of writing this KIP.
After finding those APIs and confirming that it was public, I began to
wonder  whether it really should be.

1. "Topic Group" is a vestigial term from the early days of
> the codebase. We call a "topic group" a "subtopology" in the
> public interface now (although "topic group" is still used
> internally some places). For user-facing consistency, we
> should also use "subtopologyId" in your proposal.


100% agree, I remember being very confused by the term "topic group"  when
I was first getting to know
the Streams code base, and wondering why it wasn't just called
"subtopology". This would be the perfect
opportunity to migrate to the more clear terminology where appropriate.
I've updated the KIP to reflect this.

2. I'm wondering if it's really necessary to introduce this
> interface at all. I think your motivation is to be able to
> get the subtopologyId and partition via TaskMetadata, right?
> Why not just add those methods to TaskMetadata? Stepping
> back, the concept of metadata about an identifier is a bit
> elaborate.


I think if the only issue we wanted to address was the TaskMetadata
returning a string representation of a
TaskId rather than the real thing, as was originally the case, then I would
agree with just  unfolding the API
to expose the TaskId fields directly. However while digging into the
question of whether TaskId was public, it
became clear that the class had some awkward features for a public API:
public fields with no getters, many
utility methods for internal functionality that had to be made public and
cluttered up the API, etc. So I believe
this KIP can and should take the opportunity to clean up the TaskId API
altogether, not just for one getter in
TaskMetadata. And we can't deprecate an API without adding a replacement
for users to migrate onto, since
the concept of a TaskId itself is still very much relevant. That's the
primary motivation behind introducing this
new interface. Of course technically this could be done by retaining the
existing TaskId class and just moving
the things that should not be public to the new internal class, but that's
just a worse version of the same idea,
and would require a somewhat awkward deprecation cycle. Plus, given that
TaskId is intended to just be a
wrapper for some misc information about a Task, a public interface feels
more natural than a class.

It also gives us more flexibility in the future, as it's much easier to
just add a field to the TaskId (with a new
getter on the interface) then to track  down all places where we return the
TaskId fields, ie every topicGroupId()
and partition() getters on an API, and add a new getter API for this field.

Please refer to the KIP for the current proposal, including all deprecated
and new APIs. I hope the latest
iteration will make everyone happy. The only potentially controversial
question remaining is how to name
the new interface. Here are all the options that seem reasonable to me,
though chose to go with the 2nd one:

1) Keep the name TaskId for the interface, put the interface in the
o.a.k.streams package, and call the internal class
    TaskIdImpl. This will require a fully qualified class in any file that
references both the new interface and the
    deprecated public TaskId class.
2) Same as the above, but allow the internal class to also retain the name
TaskId. This will also require a fully qualified
    class name whenever any of the three are referenced in the same file,
though I don't think there is likely to be a single
    file where the new internal TaskId class appears along with the old
public TaskId class but not the new TaskId interface,
    or vice versa, so the impact is probably exactly the same as the above
option.
3) Name the interface something else entirely (eg TaskInfo) and keep the
name TaskId for the new internal class. The
    new internal TaskId and the deprecated public TaskId class most likely
appear together. This option is probably the
    least annoying with respect to requiring fully qualified class names,
but I expect it's still on par with how often this
    would be required in option 1 or 2.

Given that both the new public interface and the deprecated public TaskId
class are almost always going to appear together,
we can't avoid fully qualified path names until the deprecated class can be
removed, unless we give the interface a different
name. And most of the time only the internal TaskId class will be
referenced anyways, so the scope of fully qualified names
should still be small with any option. In fact the vast majority of the
codebase will be using the new internal class, so leaving it
as "TaskId" will help prevent a truly massive PR, as only the import line
will need to be changed, vs hundreds/thousands of lines
in which the class name of the TaskId object appears. So I would rather
avoid option 1.

With option 2 we'll need to suppress a spotbugs warning, while with option
3 we are changing the name of a pretty fundamental
concept throughout Streams, so users will need to mentally adapt to
thinking about TaskInfo (or whatever) instead of in terms
of TaskId. That is why I personally prefer option 2.


Thanks for the discussion, sorry this turned into such a long response. I
think we're on the same page now but I want to make
sure the background and my general thought process behind all this was
clear to all. I'll kick off the vote again but please respond
here with any further concerns and save the [VOTE] thread for voting and
maybe the occasional small question.

-Sophie

Reply via email to