Another update:

After going through all the options currently on the table, I've decided to
go back to an earlier version of the
proposal, in which we just fix the TaskMetadata API to return a TaskId
object rather than its String representation,
and just clean up the aspects of the TaskId class which are not suitable
for a public API. The full details are in
the Motivation and Rejected Alternatives sections of the KIP.

KIP-740: Clean up public API in TaskId and fix TaskMetadata#taskId()
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181306557>

Also, in a previous message I asserted that changing from a class to an
abstract class or interface was not a
compatible change. However we've relaxed the compatibility contract in
Streams a bit to guarantee source
compatibility rather than binary compatibility, in which case this kind of
change does in fact satisfy our policy.
However this was ultimately rejected for other reasons, see the KIP.

Thanks all,
Sophie

On Sun, May 16, 2021 at 11:40 PM Sophie Blee-Goldman <sop...@confluent.io>
wrote:

> Changing `TaskId` to an abstract class and deprecate all util functions /
>> fields that we do not
>> want to expose any longer
>
>
> I had considered doing exactly this, where we just convert the existing
> public TaskId class to an interface
> or an abstract class rather than introducing a new one, but it turns out
> that modifying a class in this way is
> not backwards compatible (see Evolving API packages
> <https://wiki.eclipse.org/Evolving_Java-based_APIs_2#:~:text=(1)%20API%20class-interface,method%20declared%20in%20a%20class.>
>  and Evolving API Classes
> <https://wiki.eclipse.org/Evolving_Java-based_APIs_2#Evolving_API_classes_-_API_methods_and_constructors>).
> So it's off the table.
>
> However, that still leaves the door open for reclaiming the existing
> TaskId class and just deprecating the
> inappropriate public APIs, then extending this with a subclass. But this
> just does not seem like an appropriate
> application of subclassing (to me), as that hierarchy structure is exactly
> what interfaces are for. I tried to touch
> on this in my last reply but admittedly did so only briefly, and rather
> dismissively. I would be happy to consider
> this option as well, though as is probably clear by now, it's not my
> personal preference.
>
> Thoughts?
>
> On Sun, May 16, 2021, 9:17 PM Guozhang Wang <wangg...@gmail.com> wrote:
>
>> Sophie,
>>
>> Thanks for the nice summary. Originally I'm leaning towards option 2)
>> since
>> changing this name to others could be a pretty large conceptual change to
>> users, and I think the concerns you raised is good as well that we cannot
>> reuse the `taskId()` function names any more..
>>
>> After pondering on it a bit, what about 1) keeping `TaskId` in the current
>> package as that's not our main concern anyways, 2) changing `TaskId` to an
>> abstract class and deprecate all util functions / fields that we do not
>> want to expose any longer, and 3) introducing a new class that extends the
>> o.a.k.streams.processor.TaskId for internal implementations?
>>
>>
>> Guozhang
>>
>>
>> On Sun, May 16, 2021 at 7:19 PM Sophie Blee-Goldman
>> <sop...@confluent.io.invalid> wrote:
>>
>> > Sorry, one correction regarding the discussion on naming below: the
>> other
>> > con for reusing TaskId as the interface name is that we would not be
>> able
>> > to follow the standard getter naming conventions and introduce new
>> taskId()
>> > APIs to replace the deprecated ones, as the deprecated methods have
>> already
>> > claimed that name. So we would have to come up with a different name for
>> > these APIs, for example taskInfo()  would require the TaskId interface
>> > which is a bit confusing. Or using getTaskId() temporarily until the
>> > deprecated methods can be removed, which goes against the established
>> > convention of no "get" in the getters. Another possibility is something
>> > like "taskIdInfo()" or "taskIdMetadata()" so that it's still mentioning
>> > taskId somewhere in the name.
>> >
>> > Thoughts? Given this awkwardness I'm somewhat leaning towards renaming
>> the
>> > interface after all, as in the example above to "TaskInfo". Or we could
>> > take a similar approach and just build on the original TaskId name,
>> calling
>> > it TaskIdInfo or something like that. Any thoughts? As always the KIP
>> has
>> > been updated with what I am proposing, so you can take a look yourself.
>> >
>> > On Sun, May 16, 2021 at 7:07 PM Sophie Blee-Goldman <
>> sop...@confluent.io>
>> > wrote:
>> >
>> > > I'm moving the discussion on KIP-740 to an actual [DISCUSS] thread
>> since
>> > > it turned out to be more nuanced
>> > > and grew in scope since I initially proposed it. Thanks John for
>> > > the suggestions/questions on the [VOTE] thread,
>> > > I have copied them over and addressed them below.
>> > >
>> > > Before that, I just want to apologize to everyone for bringing this
>> to a
>> > > vote immediately. I truly thought this would
>> > > be a trivial KIP that just fixed what seemed like a bug in the API,
>> but
>> > it
>> > > slowly grew in scope as I wrangled with
>> > > the question of whether TaskId was already a public API. To clarify,
>> it
>> > is
>> > > -- a TaskId is returned by the
>> > > Processor/StateStoreContext.taskId() methods, I just hadn't noticed
>> those
>> > > APIs at the time of writing this KIP.
>> > > After finding those APIs and confirming that it was public, I began to
>> > > wonder  whether it really should be.
>> > >
>> > > 1. "Topic Group" is a vestigial term from the early days of
>> > >> the codebase. We call a "topic group" a "subtopology" in the
>> > >> public interface now (although "topic group" is still used
>> > >> internally some places). For user-facing consistency, we
>> > >> should also use "subtopologyId" in your proposal.
>> > >
>> > >
>> > > 100% agree, I remember being very confused by the term "topic group"
>> > when
>> > > I was first getting to know
>> > > the Streams code base, and wondering why it wasn't just called
>> > > "subtopology". This would be the perfect
>> > > opportunity to migrate to the more clear terminology where
>> appropriate.
>> > > I've updated the KIP to reflect this.
>> > >
>> > > 2. I'm wondering if it's really necessary to introduce this
>> > >> interface at all. I think your motivation is to be able to
>> > >> get the subtopologyId and partition via TaskMetadata, right?
>> > >> Why not just add those methods to TaskMetadata? Stepping
>> > >> back, the concept of metadata about an identifier is a bit
>> > >> elaborate.
>> > >
>> > >
>> > > I think if the only issue we wanted to address was the TaskMetadata
>> > > returning a string representation of a
>> > > TaskId rather than the real thing, as was originally the case, then I
>> > > would agree with just  unfolding the API
>> > > to expose the TaskId fields directly. However while digging into the
>> > > question of whether TaskId was public, it
>> > > became clear that the class had some awkward features for a public
>> API:
>> > > public fields with no getters, many
>> > > utility methods for internal functionality that had to be made public
>> and
>> > > cluttered up the API, etc. So I believe
>> > > this KIP can and should take the opportunity to clean up the TaskId
>> API
>> > > altogether, not just for one getter in
>> > > TaskMetadata. And we can't deprecate an API without adding a
>> replacement
>> > > for users to migrate onto, since
>> > > the concept of a TaskId itself is still very much relevant. That's the
>> > > primary motivation behind introducing this
>> > > new interface. Of course technically this could be done by retaining
>> the
>> > > existing TaskId class and just moving
>> > > the things that should not be public to the new internal class, but
>> > that's
>> > > just a worse version of the same idea,
>> > > and would require a somewhat awkward deprecation cycle. Plus, given
>> that
>> > > TaskId is intended to just be a
>> > > wrapper for some misc information about a Task, a public interface
>> feels
>> > > more natural than a class.
>> > >
>> > > It also gives us more flexibility in the future, as it's much easier
>> to
>> > > just add a field to the TaskId (with a new
>> > > getter on the interface) then to track  down all places where we
>> return
>> > > the TaskId fields, ie every topicGroupId()
>> > > and partition() getters on an API, and add a new getter API for this
>> > field.
>> > >
>> > > Please refer to the KIP for the current proposal, including all
>> > deprecated
>> > > and new APIs. I hope the latest
>> > > iteration will make everyone happy. The only potentially controversial
>> > > question remaining is how to name
>> > > the new interface. Here are all the options that seem reasonable to
>> me,
>> > > though chose to go with the 2nd one:
>> > >
>> > > 1) Keep the name TaskId for the interface, put the interface in the
>> > > o.a.k.streams package, and call the internal class
>> > >     TaskIdImpl. This will require a fully qualified class in any file
>> > that
>> > > references both the new interface and the
>> > >     deprecated public TaskId class.
>> > > 2) Same as the above, but allow the internal class to also retain the
>> > name
>> > > TaskId. This will also require a fully qualified
>> > >     class name whenever any of the three are referenced in the same
>> file,
>> > > though I don't think there is likely to be a single
>> > >     file where the new internal TaskId class appears along with the
>> old
>> > > public TaskId class but not the new TaskId interface,
>> > >     or vice versa, so the impact is probably exactly the same as the
>> > above
>> > > option.
>> > > 3) Name the interface something else entirely (eg TaskInfo) and keep
>> the
>> > > name TaskId for the new internal class. The
>> > >     new internal TaskId and the deprecated public TaskId class most
>> > likely
>> > > appear together. This option is probably the
>> > >     least annoying with respect to requiring fully qualified class
>> names,
>> > > but I expect it's still on par with how often this
>> > >     would be required in option 1 or 2.
>> > >
>> > > Given that both the new public interface and the deprecated public
>> TaskId
>> > > class are almost always going to appear together,
>> > > we can't avoid fully qualified path names until the deprecated class
>> can
>> > > be removed, unless we give the interface a different
>> > > name. And most of the time only the internal TaskId class will be
>> > > referenced anyways, so the scope of fully qualified names
>> > > should still be small with any option. In fact the vast majority of
>> the
>> > > codebase will be using the new internal class, so leaving it
>> > > as "TaskId" will help prevent a truly massive PR, as only the import
>> line
>> > > will need to be changed, vs hundreds/thousands of lines
>> > > in which the class name of the TaskId object appears. So I would
>> rather
>> > > avoid option 1.
>> > >
>> > > With option 2 we'll need to suppress a spotbugs warning, while with
>> > option
>> > > 3 we are changing the name of a pretty fundamental
>> > > concept throughout Streams, so users will need to mentally adapt to
>> > > thinking about TaskInfo (or whatever) instead of in terms
>> > > of TaskId. That is why I personally prefer option 2.
>> > >
>> > >
>> > > Thanks for the discussion, sorry this turned into such a long
>> response. I
>> > > think we're on the same page now but I want to make
>> > > sure the background and my general thought process behind all this was
>> > > clear to all. I'll kick off the vote again but please respond
>> > > here with any further concerns and save the [VOTE] thread for voting
>> and
>> > > maybe the occasional small question.
>> > >
>> > > -Sophie
>> > >
>> >
>>
>>
>> --
>> -- Guozhang
>>
>

Reply via email to