[
https://issues.apache.org/jira/browse/KAFKA-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alieh Saeedi updated KAFKA-20179:
---------------------------------
Description:
When writing to the changelog topic, we currently need to deserialize the
record headers, which is not ideal from a performance perspective. We should
look for a way to avoid this overhead.
One option, as suggested by [~mjsax] , would be to reuse the header bytes from
the Kafka message format directly, without deserialization. However, this would
couple Kafka Streams tightly to the internal serialization format of headers in
the Kafka record. Although we already use the same format today, we are not yet
_formally_ coupled to it. If the Kafka message format for headers ever changes,
this approach would directly impact Kafka Streams and raise compatibility
concerns (ref:
[https://github.com/apache/kafka/pull/21345/changes#r2761685550)|https://github.com/apache/kafka/pull/21345/changes#r2761685550].
Another complication is that the processor context can still mutate headers—for
example, by adding new entries—before the record is sent to the producer (see
{{{}[ProcessorContextImpl#log():
https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorContextImpl.java#L136{}}}).
This leaves us with two options:
# Deserialize the headers anyway, which means we lose the optimization; or
# Manually merge: serialize any newly added header entries (e.g., the vector
clock), append them to the existing serialized headers, and then send the
combined header bytes to the producer.
Neither option is entirely straightforward, so we need to weigh the performance
gains of avoiding deserialization against the tighter coupling and added
complexity this would introduce.
was:
When writing to the changelog topic, we currently need to deserialize the
headers, which is not ideal. We should find a way to avoid this for efficiency
reasons.
Suggested solution by [~mjsax]: `We would couple ourselves strictly to the
serialization format of headers inside the message format thought. While we
already use the same format atm we are not tightly coupled atm. So we should
consider the implications if we want to do go down this route. For the case
that this part of the Kafka message format changes, it would impact KS, and
this raised compatibility questions` (ref:
[https://github.com/apache/kafka/pull/21345/changes#r2761685550)|https://github.com/apache/kafka/pull/21345/changes#r2761685550]
Another challenge is that the processor context might still modify the
headers—by adding new entries—before the record is sent to the producer:
[https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/stre[…]che/kafka/streams/processor/internals/ProcessorContextImpl.java|https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorContextImpl.java#L136]
So either 1) we have to deserialize the headers anyway (no optimization) or 2)
serialize the added things (vector clock in this case) and add them to our
serialized headers and then send them.
> Avoiding headers deserialization while changelogging
> ----------------------------------------------------
>
> Key: KAFKA-20179
> URL: https://issues.apache.org/jira/browse/KAFKA-20179
> Project: Kafka
> Issue Type: Sub-task
> Reporter: Alieh Saeedi
> Priority: Major
>
> When writing to the changelog topic, we currently need to deserialize the
> record headers, which is not ideal from a performance perspective. We should
> look for a way to avoid this overhead.
> One option, as suggested by [~mjsax] , would be to reuse the header bytes
> from the Kafka message format directly, without deserialization. However,
> this would couple Kafka Streams tightly to the internal serialization format
> of headers in the Kafka record. Although we already use the same format
> today, we are not yet _formally_ coupled to it. If the Kafka message format
> for headers ever changes, this approach would directly impact Kafka Streams
> and raise compatibility concerns (ref:
> [https://github.com/apache/kafka/pull/21345/changes#r2761685550)|https://github.com/apache/kafka/pull/21345/changes#r2761685550].
> Another complication is that the processor context can still mutate
> headers—for example, by adding new entries—before the record is sent to the
> producer (see {{{}[ProcessorContextImpl#log():
> https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorContextImpl.java#L136{}}}).
> This leaves us with two options:
> # Deserialize the headers anyway, which means we lose the optimization; or
> # Manually merge: serialize any newly added header entries (e.g., the vector
> clock), append them to the existing serialized headers, and then send the
> combined header bytes to the producer.
> Neither option is entirely straightforward, so we need to weigh the
> performance gains of avoiding deserialization against the tighter coupling
> and added complexity this would introduce.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)