[ 
https://issues.apache.org/jira/browse/KAFKA-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407352#comment-16407352
 ] 

Ewen Cheslack-Postava commented on KAFKA-3821:
----------------------------------------------

So a couple of points --
 * That new SourceRecordReceiver interface doesn't necessarily avoid 
allocations, it likely just moves them around. The framework would still want 
to track the collection generated from a single poll() call because in the case 
of a connector throwing an error mid-poll(), we'd want to not actually write 
any of the messages generated thus far. So we basically just move the list into 
SourceRecordReceiver. In fact, doing so could make things worse because whereas 
a Connector may be able to figure out how large the list needs to be and do a 
single allocation, the framework has no idea and would probably do multiple 
rounds of expanding the list.
 * In any case, I really think that one allocation isn't worth worrying about 
unless someone has profiling to show it. We allocate so many objects just for a 
single SourceRecord, *especially* in pretty common cases of complex object 
structure, that this doesn't seem worth optimizing.
 * I think we should focus on optimizing for the common case, which is that 
there will be multiple messages. When that's not the case, the performance 
impact seems unimportant since it would probably mean you have relatively 
infrequent events.
 * Adding more interfaces adds to cognitive load and makes it harder to learn 
how to write connectors. Context objects already provide a place to request the 
framework do things outside the common workflow, so it seems like a natural 
place to add this functionality if we decided to. Same deal for the EOS stuff, 
which could potentially just be transaction APIs in the Context.
 * Just stylistically, Kafka's public APIs tend to try to keep things simple 
and straightforward (for some definition of those words that I honestly am not 
sure I could give a clear explanation of).

I don't want to discourage the discussion of how to solve this problem for 
Debezium's use case, but I do want to make sure we're taking into account 
broader goals for the framework when figuring out how to solve it (e.g. 
SourceRecordReceiver may work, but I would argue returning a list of records 
makes things easier since it's obvious from one line of code how to implement 
it).

It might also help to explain better why the general idea here just rubs me the 
wrong way. Mostly it boils down to sort of breaking the abstraction. The way 
offsets are handled for source connectors was supposed to mirror Kafka's log 
abstraction such that an offset really is unique. I guess maybe a gap in 
understanding for me is why offsets *wouldn't* be unique during snapshots and 
how rewind works for these snapshots. I get that not all systems or use cases 
can map perfectly, but a) would another option be to change how those offsets 
are handled by Debezium, b) would another option be having Debezium read 1 
record forward to determine before returning the record and/or constructing its 
offset whether this is the final record of the snapshot, and c) do we have 
other use cases the demonstrate an impedance mismatch that would be well 
answered by solutions being proposed here?

> Allow Kafka Connect source tasks to produce offset without writing to topics
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-3821
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3821
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 0.9.0.1
>            Reporter: Randall Hauch
>            Priority: Major
>              Labels: needs-kip
>             Fix For: 1.2.0
>
>
> Provide a way for a {{SourceTask}} implementation to record a new offset for 
> a given partition without necessarily writing a source record to a topic.
> Consider a connector task that uses the same offset when producing an unknown 
> number of {{SourceRecord}} objects (e.g., it is taking a snapshot of a 
> database). Once the task completes those records, the connector wants to 
> update the offsets (e.g., the snapshot is complete) but has no more records 
> to be written to a topic. With this change, the task could simply supply an 
> updated offset.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to