Hi Konstantine,

Thanks for the KIP! This is going to make automatic integration with
Connect much more powerful.

My thoughts are mostly around freshness of the data and being able to
expose that to users. Riffing on Randall's timestamp question - have we
considered adding some interval at which point a connector will republish
any topics that it encounters and update the timestamp? That way we have
some refreshing mechanism that isn't as powerful as the complete reset
(which may not be practical in many scenarios).

I also agree with Randall's other point (Would it be better to not
automatically reset connector's active topics when a sink connector is
restarted?). I think keeping the behavior as symmetrical between sink and
source connectors is a good idea.

Lastly, with regards to the API, I can imagine it is also pretty useful to
answer the inverse question: "which connectors write to topic X". Perhaps
we can achieve this by letting the users compute it and just expose an API
that returns the entire mapping at once (instead of needing to call the
/connectors/{name}/topics endpoint for each connector).

Otherwise, looks good to me! Hits the requirements that I had in mind on
the nose.
- Almog

On Wed, Jan 15, 2020 at 1:14 AM Tom Bentley <tbent...@redhat.com> wrote:

> Hi Konstantine,
>
> Thanks for the KIP, I can see how it could be useful.
>
> a) Did you consider using a metric for this? I don't think it would satisfy
> all the use cases you have in mind, but you could mention it in the
> rejected alternatives.
>
> b) If the topic name contains the string "-connector" then the key format
> is ambiguous. This isn't necessarily fatal because the value will
> disambiguate, but it could be misleading. Any reason not to just use a JSON
> key, and simplify the value?
>
> c) I didn't understand this part: "As soon as a worker detects the addition
> of a topic to a connector's set of active topics, the worker will cease to
> post update messages to the status.storage.topic for that connector. ". I'm
> sure I've overlooking something but why is this necessary? Is this were the
> task id in the value is used?
>
> Thanks again,
>
> Tom
>
> On Wed, Jan 15, 2020 at 12:15 AM Randall Hauch <rha...@gmail.com> wrote:
>
> > Oh, one more thing:
> >
> > 9. There's no mention of how the status topic is partitioned, or how
> > partitioning will be used by the new topic records. The KIP should
> probably
> > outline this for clarity and completeness.
> >
> > Best regards,
> >
> > Randall
> >
> > On Tue, Jan 14, 2020 at 5:25 PM Randall Hauch <rha...@gmail.com> wrote:
> >
> > > Thanks, Konstantine. Overall, this KIP looks interesting and really
> > > useful, and for the most part is spot on. I do have a number of
> > > questions/comments about specifics:
> > >
> > >    1. The topic records have a value that includes the connector name,
> > >    task number that last reported the topic is used, and the topic
> name.
> > >    There's no mention of record timestamps, but I wonder if it'd be
> > useful to
> > >    record this. One challenge might be that a connector does not write
> > to a
> > >    topic for a while or the task remains running for long periods of
> > time and
> > >    therefore the worker doesn't record that this topic has been newly
> > written
> > >    to since it the task was restarted. IOW, the semantics of the
> > timestamp may
> > >    be a bit murky. Have you thought about recording the timestamp, and
> > if so
> > >    what are the pros and cons?
> > >    - The "Recording active topics" section says the following:
> > >       "As soon as a worker detects the addition of a topic to a
> > >       connector's set of active topics, all the connector's tasks that
> > inspect
> > >       source or sink records will cease to post update messages to the
> > >       status.storage.topic."
> > >       This probably means the timestamp won't be very useful.
> > >    2. The KIP says "the Kafka record value stores the ID of the task
> that
> > >    succeeded to store a topic status record last." However, this is a
> bit
> > >    unclear: is it really storing the last task that successfully wrote
> > to that
> > >    topic (as this would require very frequent writes to this topic), or
> > is it
> > >    more that this is the task that was last *recorded* as having
> written
> > >    to the topic? (Here, "recorded" could be a bit of a gray area, since
> > this
> > >    would depend on the how the worker periodically records this
> > information.)
> > >    Any kind of clarity here might be helpful.
> > >    3. In the "Recording active topics" section (and the surrounding
> > >    sections), the "task" is used ambiguously. For example, "when its
> > tasks
> > >    start processing their first records ... these tasks will start
> > inspecting
> > >    which is the Kafka topic of each of these records". IIUC, the first
> > "task"
> > >    mentioned is the connector's task, and the second is the worker's
> > task. Do
> > >    we need to distinguish this more clearly?
> > >    4. Maybe I missed it, but does this KIP explicitly say that the
> > >    Connector API is unchanged? It's probably worth pointing out to help
> > >    assuage any concerns that connector implementations have to change
> to
> > make
> > >    use of this feature.
> > >    5. In the "Resetting a connector's set of active topics" section the
> > >    behavior is not exactly clear. Consider a user running connector
> "A",
> > the
> > >    connector has been fully started and is processing records, and the
> > worker
> > >    has recorded topic usage records. Then the user resets the active
> > topics
> > >    for connector A while the connector is still running? If the
> connector
> > >    writes to no new topics, before the tasks are rebalanced then is it
> > correct
> > >    that Connect would report no active topics? And after the tasks are
> > >    rebalance, will the worker record any topics used by connector A?
> > >    6. In the "Restaring" (misspelled) section: "Reconfiguring a source
> > >    connector has also no altering effect for a source connector.
> > However, when
> > >    reconfiguring a sink connector if the new configuration no longer
> > includes
> > >    any of the previously tracked topics, these topics will be removed
> > from the
> > >    set of active topics for this sink connector by appending tombstone
> > >    messages appropriately after the reconfiguration of the connector."
> > Would
> > >    it be better to not automatically reset connector's active topics
> > when a
> > >    sink connector is restarted? Isn't that more consistent with the
> > >    "Resetting" behavior and the goals at the top of the KIP: "it'd be
> > useful
> > >    for users, operators and applications to know which are the topics
> > that a
> > >    connector has used since it was first created"?
> > >    7. The `PUT /connectors/{name}/topics/reset` endpoint "this request
> > >    can be reapplied after the deletion of the connector". IOW, even
> > though
> > >    connector with that name doesn't exist, we can still make this
> > request? How
> > >    does this compare with other methods such as "status"?
> > >    8. What are the security implications of this proposal?
> > >
> > > As you can see, most of these can probably be addressed without much
> > work.
> > >
> > > Best regards,
> > >
> > > Randall
> > >
> > > On Mon, Jan 13, 2020 at 11:05 PM Konstantine Karantasis <
> > > konstant...@confluent.io> wrote:
> > >
> > >> Hi all.
> > >>
> > >> I just posted KIP-558: Track the set of actively used topics by
> > connectors
> > >> in Kafka Connect
> > >>
> > >> Wiki link here:
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect
> > >>
> > >> I think it's a nice extension to follow up on KIP-158 and a useful
> > feature
> > >> to the ever increasing number of applications that are built around
> > Kafka
> > >> Connect.
> > >> Would love to hear what you think.
> > >>
> > >> Best,
> > >> Konstantine
> > >>
> > >
> >
>

Reply via email to