[
https://issues.apache.org/jira/browse/KAFKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407352#comment-16407352
]
Ewen Cheslack-Postava commented on KAFKA-3821:
----------------------------------------------
So a couple of points --
* That new SourceRecordReceiver interface doesn't necessarily avoid
allocations, it likely just moves them around. The framework would still want
to track the collection generated from a single poll() call because in the case
of a connector throwing an error mid-poll(), we'd want to not actually write
any of the messages generated thus far. So we basically just move the list into
SourceRecordReceiver. In fact, doing so could make things worse because whereas
a Connector may be able to figure out how large the list needs to be and do a
single allocation, the framework has no idea and would probably do multiple
rounds of expanding the list.
* In any case, I really think that one allocation isn't worth worrying about
unless someone has profiling to show it. We allocate so many objects just for a
single SourceRecord, *especially* in pretty common cases of complex object
structure, that this doesn't seem worth optimizing.
* I think we should focus on optimizing for the common case, which is that
there will be multiple messages. When that's not the case, the performance
impact seems unimportant since it would probably mean you have relatively
infrequent events.
* Adding more interfaces adds to cognitive load and makes it harder to learn
how to write connectors. Context objects already provide a place to request the
framework do things outside the common workflow, so it seems like a natural
place to add this functionality if we decided to. Same deal for the EOS stuff,
which could potentially just be transaction APIs in the Context.
* Just stylistically, Kafka's public APIs tend to try to keep things simple
and straightforward (for some definition of those words that I honestly am not
sure I could give a clear explanation of).
I don't want to discourage the discussion of how to solve this problem for
Debezium's use case, but I do want to make sure we're taking into account
broader goals for the framework when figuring out how to solve it (e.g.
SourceRecordReceiver may work, but I would argue returning a list of records
makes things easier since it's obvious from one line of code how to implement
it).
It might also help to explain better why the general idea here just rubs me the
wrong way. Mostly it boils down to sort of breaking the abstraction. The way
offsets are handled for source connectors was supposed to mirror Kafka's log
abstraction such that an offset really is unique. I guess maybe a gap in
understanding for me is why offsets *wouldn't* be unique during snapshots and
how rewind works for these snapshots. I get that not all systems or use cases
can map perfectly, but a) would another option be to change how those offsets
are handled by Debezium, b) would another option be having Debezium read 1
record forward to determine before returning the record and/or constructing its
offset whether this is the final record of the snapshot, and c) do we have
other use cases the demonstrate an impedance mismatch that would be well
answered by solutions being proposed here?
> Allow Kafka Connect source tasks to produce offset without writing to topics
> ----------------------------------------------------------------------------
>
> Key: KAFKA-3821
> URL: https://issues.apache.org/jira/browse/KAFKA-3821
> Project: Kafka
> Issue Type: Improvement
> Components: KafkaConnect
> Affects Versions: 0.9.0.1
> Reporter: Randall Hauch
> Priority: Major
> Labels: needs-kip
> Fix For: 1.2.0
>
>
> Provide a way for a {{SourceTask}} implementation to record a new offset for
> a given partition without necessarily writing a source record to a topic.
> Consider a connector task that uses the same offset when producing an unknown
> number of {{SourceRecord}} objects (e.g., it is taking a snapshot of a
> database). Once the task completes those records, the connector wants to
> update the offsets (e.g., the snapshot is complete) but has no more records
> to be written to a topic. With this change, the task could simply supply an
> updated offset.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)