Re: [DISCUSS] KIP-558: Track the set of actively used topics by connectors in Kafka Connect

Konstantine Karantasis Thu, 16 Jan 2020 10:54:32 -0800

Thanks for the follow up Tom. Replies inline below this time:

On Thu, Jan 16, 2020 at 1:16 AM Tom Bentley <[email protected]> wrote:


> Hi Konstantine,
>
> b) I thought hard but briefly about this. Although there's potential for
> > ambiguity if someone inspects the key on its own, I believe that if we
> > combine this information with what we save in the value, the ambiguity is
> > removed. With respect to whether a connector can override another
> > connector's key, I don't think that's possible for any combination of
> > unique topic name and connector name. I hope I'm not missing anything.
> >
>
> I think it is possible for connectors to have conflicting keys:
>
> * topic-name="foo-connector" and connector-name="bar"
> * topic-name="foo" and connector-name="connector-bar"
>
> would both have the key "topic-foo-connector-connector-bar"
>
>
Indeed. Thanks for the pointer Tom. A collision is not as unlikely as I
initially thought. I see roughly two solutions:
a) Use a special character as a delimiter. I initially wanted to avoid
this, but given that topic names have quite restrictive rules with respect
to their allowed character set, the colon character ':' could be used to
separate the topic from the connector name. Connector names are more
flexible, but restrictions in the topic name will suffice. I suggest
keeping prefix 'topic-' and 'connector-' after the delimiter.
b) Use json to encode the key.

In order to be more consistent with the format of existing keys in the
status topic, I'm suggesting to adopt option (a) - use ':' as a separator -
at the moment. If at some point we decide to start using json for the
record keys of internal Connect topics, we should do it in a more concerted
way and in a separate KIP.



> > c) That means that worker tasks will produce topic status records once
> they
> > detect that the worker they are running on does not include a topic that
> is
> > used by the tasks in the list of active topics for this connector. But
> once
> > the worker adds this topic to the set of active topics for this
> connector,
> > then the worker tasks stop producing those messages. They'll produce one
> > again if the worker stops including that topic in the active topics set
> for
> > some reason. I could improve the wording on the KIP, but I'd also like to
> > know we are on the same page re: the design here.
> >
>
> OK, I understand now. I think it would be helpful to include the example
> you gave in your previous reply.
>

Definitely. Will add the example.



> Many thanks,
>
> Tom
>
>
I've updated the KIP to mention the delimiter and using a metric to track
active topics as a rejected alternative.
Following up with the rest of the changes shortly.

Cheers,
Konstantine



> >
> > Let me know if the above address your questions.
> > Thanks,
> > Konstantine
> >
> >
> >
> > On Wed, Jan 15, 2020 at 12:05 PM Randall Hauch <[email protected]> wrote:
> >
> > > Almog,
> > >
> > > You raise some interesting questions. Comments inline below.
> > >
> > > On Wed, Jan 15, 2020 at 11:19 AM Almog Gavra <[email protected]>
> wrote:
> > >
> > > > Hi Konstantine,
> > > >
> > > > Thanks for the KIP! This is going to make automatic integration with
> > > > Connect much more powerful.
> > > >
> > > > My thoughts are mostly around freshness of the data and being able to
> > > > expose that to users. Riffing on Randall's timestamp question - have
> we
> > > > considered adding some interval at which point a connector will
> > republish
> > > > any topics that it encounters and update the timestamp? That way we
> > have
> > > > some refreshing mechanism that isn't as powerful as the complete
> reset
> > > > (which may not be practical in many scenarios).
> > > >
> > >
> > > My question about recording the timestamp at which each active topic
> > record
> > > were (infrequently) written was more about making a bit more
> information
> > > available given the current design, and whether recording a bit more
> > > information (for very little additional storage cost and no extra
> runtime
> > > cost) may be worth it if in the future we figure out how to use this
> > > information.
> > >
> > > I think it's more complicated to try to record the history of when
> topics
> > > were most recently used, since that requires recording a lot more
> active
> > > topic records than the current proposal. Besides, it's not unexpected
> > that
> > > source and sink connectors sometimes don't use topics for periods of
> > time.
> > > A sink connector only consumes from a topic when there are additional
> > > records to consume, and a source connector only needs to write to a
> topic
> > > when there is information in the upstream system targeted to that
> topic.
> > An
> > > example of the latter is that most database connectors will write to a
> > > topic for a particular table only when rows in that table have changed.
> > >
> > >
> > > > I also agree with Randall's other point (Would it be better to not
> > > > automatically reset connector's active topics when a sink connector
> is
> > > > restarted?). I think keeping the behavior as symmetrical between sink
> > and
> > > > source connectors is a good idea.
> > > >
> > > > Lastly, with regards to the API, I can imagine it is also pretty
> useful
> > > to
> > > > answer the inverse question: "which connectors write to topic X".
> > Perhaps
> > > > we can achieve this by letting the users compute it and just expose
> an
> > > API
> > > > that returns the entire mapping at once (instead of needing to call
> the
> > > > /connectors/{name}/topics endpoint for each connector).
> > >
> > >
> > > It may be worth considering a method such as /topics that would
> produce a
> > > result that is an aggregate of the potentially many
> > > /connectors/{name}/topic responses, especially if the result of the
> > /topics
> > > is an array of the individual responses from /connectors/{name}/topic.
> > The
> > > benefit of a single request is that the answer to "which connectors use
> > > topic X" can be computed using simple tools such as 'jq'. And, if
> tooling
> > > frequently fetches the active topic information for all connectors,
> > > providing the aggregate method would reduce the load on the tooling and
> > the
> > > server. It may also be relatively easy to implement.
> > >
> > >
> > > > Otherwise, looks good to me! Hits the requirements that I had in mind
> > on
> > > > the nose.
> > > > - Almog
> > > >
> > > > On Wed, Jan 15, 2020 at 1:14 AM Tom Bentley <[email protected]>
> > wrote:
> > > >
> > > > > Hi Konstantine,
> > > > >
> > > > > Thanks for the KIP, I can see how it could be useful.
> > > > >
> > > > > a) Did you consider using a metric for this? I don't think it would
> > > > satisfy
> > > > > all the use cases you have in mind, but you could mention it in the
> > > > > rejected alternatives.
> > > > >
> > > > > b) If the topic name contains the string "-connector" then the key
> > > format
> > > > > is ambiguous. This isn't necessarily fatal because the value will
> > > > > disambiguate, but it could be misleading. Any reason not to just
> use
> > a
> > > > JSON
> > > > > key, and simplify the value?
> > > > >
> > > > > c) I didn't understand this part: "As soon as a worker detects the
> > > > addition
> > > > > of a topic to a connector's set of active topics, the worker will
> > cease
> > > > to
> > > > > post update messages to the status.storage.topic for that
> connector.
> > ".
> > > > I'm
> > > > > sure I've overlooking something but why is this necessary? Is this
> > were
> > > > the
> > > > > task id in the value is used?
> > > > >
> > > > > Thanks again,
> > > > >
> > > > > Tom
> > > > >
> > > > > On Wed, Jan 15, 2020 at 12:15 AM Randall Hauch <[email protected]>
> > > wrote:
> > > > >
> > > > > > Oh, one more thing:
> > > > > >
> > > > > > 9. There's no mention of how the status topic is partitioned, or
> > how
> > > > > > partitioning will be used by the new topic records. The KIP
> should
> > > > > probably
> > > > > > outline this for clarity and completeness.
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Randall
> > > > > >
> > > > > > On Tue, Jan 14, 2020 at 5:25 PM Randall Hauch <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > Thanks, Konstantine. Overall, this KIP looks interesting and
> > really
> > > > > > > useful, and for the most part is spot on. I do have a number of
> > > > > > > questions/comments about specifics:
> > > > > > >
> > > > > > >    1. The topic records have a value that includes the
> connector
> > > > name,
> > > > > > >    task number that last reported the topic is used, and the
> > topic
> > > > > name.
> > > > > > >    There's no mention of record timestamps, but I wonder if
> it'd
> > be
> > > > > > useful to
> > > > > > >    record this. One challenge might be that a connector does
> not
> > > > write
> > > > > > to a
> > > > > > >    topic for a while or the task remains running for long
> periods
> > > of
> > > > > > time and
> > > > > > >    therefore the worker doesn't record that this topic has been
> > > newly
> > > > > > written
> > > > > > >    to since it the task was restarted. IOW, the semantics of
> the
> > > > > > timestamp may
> > > > > > >    be a bit murky. Have you thought about recording the
> > timestamp,
> > > > and
> > > > > > if so
> > > > > > >    what are the pros and cons?
> > > > > > >    - The "Recording active topics" section says the following:
> > > > > > >       "As soon as a worker detects the addition of a topic to a
> > > > > > >       connector's set of active topics, all the connector's
> tasks
> > > > that
> > > > > > inspect
> > > > > > >       source or sink records will cease to post update messages
> > to
> > > > the
> > > > > > >       status.storage.topic."
> > > > > > >       This probably means the timestamp won't be very useful.
> > > > > > >    2. The KIP says "the Kafka record value stores the ID of the
> > > task
> > > > > that
> > > > > > >    succeeded to store a topic status record last." However,
> this
> > > is a
> > > > > bit
> > > > > > >    unclear: is it really storing the last task that
> successfully
> > > > wrote
> > > > > > to that
> > > > > > >    topic (as this would require very frequent writes to this
> > > topic),
> > > > or
> > > > > > is it
> > > > > > >    more that this is the task that was last *recorded* as
> having
> > > > > written
> > > > > > >    to the topic? (Here, "recorded" could be a bit of a gray
> area,
> > > > since
> > > > > > this
> > > > > > >    would depend on the how the worker periodically records this
> > > > > > information.)
> > > > > > >    Any kind of clarity here might be helpful.
> > > > > > >    3. In the "Recording active topics" section (and the
> > surrounding
> > > > > > >    sections), the "task" is used ambiguously. For example,
> "when
> > > its
> > > > > > tasks
> > > > > > >    start processing their first records ... these tasks will
> > start
> > > > > > inspecting
> > > > > > >    which is the Kafka topic of each of these records". IIUC,
> the
> > > > first
> > > > > > "task"
> > > > > > >    mentioned is the connector's task, and the second is the
> > > worker's
> > > > > > task. Do
> > > > > > >    we need to distinguish this more clearly?
> > > > > > >    4. Maybe I missed it, but does this KIP explicitly say that
> > the
> > > > > > >    Connector API is unchanged? It's probably worth pointing out
> > to
> > > > help
> > > > > > >    assuage any concerns that connector implementations have to
> > > change
> > > > > to
> > > > > > make
> > > > > > >    use of this feature.
> > > > > > >    5. In the "Resetting a connector's set of active topics"
> > section
> > > > the
> > > > > > >    behavior is not exactly clear. Consider a user running
> > connector
> > > > > "A",
> > > > > > the
> > > > > > >    connector has been fully started and is processing records,
> > and
> > > > the
> > > > > > worker
> > > > > > >    has recorded topic usage records. Then the user resets the
> > > active
> > > > > > topics
> > > > > > >    for connector A while the connector is still running? If the
> > > > > connector
> > > > > > >    writes to no new topics, before the tasks are rebalanced
> then
> > is
> > > > it
> > > > > > correct
> > > > > > >    that Connect would report no active topics? And after the
> > tasks
> > > > are
> > > > > > >    rebalance, will the worker record any topics used by
> connector
> > > A?
> > > > > > >    6. In the "Restaring" (misspelled) section: "Reconfiguring a
> > > > source
> > > > > > >    connector has also no altering effect for a source
> connector.
> > > > > > However, when
> > > > > > >    reconfiguring a sink connector if the new configuration no
> > > longer
> > > > > > includes
> > > > > > >    any of the previously tracked topics, these topics will be
> > > removed
> > > > > > from the
> > > > > > >    set of active topics for this sink connector by appending
> > > > tombstone
> > > > > > >    messages appropriately after the reconfiguration of the
> > > > connector."
> > > > > > Would
> > > > > > >    it be better to not automatically reset connector's active
> > > topics
> > > > > > when a
> > > > > > >    sink connector is restarted? Isn't that more consistent with
> > the
> > > > > > >    "Resetting" behavior and the goals at the top of the KIP:
> > "it'd
> > > be
> > > > > > useful
> > > > > > >    for users, operators and applications to know which are the
> > > topics
> > > > > > that a
> > > > > > >    connector has used since it was first created"?
> > > > > > >    7. The `PUT /connectors/{name}/topics/reset` endpoint "this
> > > > request
> > > > > > >    can be reapplied after the deletion of the connector". IOW,
> > even
> > > > > > though
> > > > > > >    connector with that name doesn't exist, we can still make
> this
> > > > > > request? How
> > > > > > >    does this compare with other methods such as "status"?
> > > > > > >    8. What are the security implications of this proposal?
> > > > > > >
> > > > > > > As you can see, most of these can probably be addressed without
> > > much
> > > > > > work.
> > > > > > >
> > > > > > > Best regards,
> > > > > > >
> > > > > > > Randall
> > > > > > >
> > > > > > > On Mon, Jan 13, 2020 at 11:05 PM Konstantine Karantasis <
> > > > > > > [email protected]> wrote:
> > > > > > >
> > > > > > >> Hi all.
> > > > > > >>
> > > > > > >> I just posted KIP-558: Track the set of actively used topics
> by
> > > > > > connectors
> > > > > > >> in Kafka Connect
> > > > > > >>
> > > > > > >> Wiki link here:
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-558%3A+Track+the+set+of+actively+used+topics+by+connectors+in+Kafka+Connect
> > > > > > >>
> > > > > > >> I think it's a nice extension to follow up on KIP-158 and a
> > useful
> > > > > > feature
> > > > > > >> to the ever increasing number of applications that are built
> > > around
> > > > > > Kafka
> > > > > > >> Connect.
> > > > > > >> Would love to hear what you think.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Konstantine
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-558: Track the set of actively used topics by connectors in Kafka Connect

Reply via email to