[ 
https://issues.apache.org/jira/browse/KAFKA-20179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alieh Saeedi updated KAFKA-20179:
---------------------------------
    Description: 
When writing to the changelog topic, we currently need to deserialize the 
record headers, which is not ideal from a performance perspective. We should 
look for a way to avoid this overhead.

One option, as suggested by [~mjsax] , would be to reuse the header bytes from 
the Kafka message format directly, without deserialization. However, this would 
couple Kafka Streams tightly to the internal serialization format of headers in 
the Kafka record. Although we already use the same format today, we are not yet 
_formally_ coupled to it. If the Kafka message format for headers ever changes, 
this approach would directly impact Kafka Streams and raise compatibility 
concerns (ref: 
[https://github.com/apache/kafka/pull/21345/changes#r2761685550)|https://github.com/apache/kafka/pull/21345/changes#r2761685550].

Another complication is that the processor context can still mutate headers—for 
example, by adding new entries—before the record is sent to the producer (see 
{{{}[ProcessorContextImpl#log(): 
https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorContextImpl.java#L136{}}}).
 This leaves us with two options:
 # Deserialize the headers anyway, which means we lose the optimization; or
 # Manually merge: serialize any newly added header entries (e.g., the vector 
clock), append them to the existing serialized headers, and then send the 
combined header bytes to the producer.

Neither option is entirely straightforward, so we need to weigh the performance 
gains of avoiding deserialization against the tighter coupling and added 
complexity this would introduce.

  was:
When writing to the changelog topic, we currently need to deserialize the 
headers, which is not ideal. We should find a way to avoid this for efficiency 
reasons.

Suggested solution by [~mjsax]: `We would couple ourselves strictly to the 
serialization format of headers inside the message format thought. While we 
already use the same format atm we are not tightly coupled atm. So we should 
consider the implications if we want to do go down this route. For the case 
that this part of the Kafka message format changes, it would impact KS, and 
this raised compatibility questions` (ref: 
[https://github.com/apache/kafka/pull/21345/changes#r2761685550)|https://github.com/apache/kafka/pull/21345/changes#r2761685550]

Another challenge is that the processor context might still modify the 
headers—by adding new entries—before the record is sent to the producer: 
[https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/stre[…]che/kafka/streams/processor/internals/ProcessorContextImpl.java|https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorContextImpl.java#L136]
So either 1) we have to deserialize the headers anyway (no optimization) or 2) 
serialize the added things (vector clock in this case) and add them to our 
serialized headers and then send them.


> Avoiding headers deserialization while changelogging
> ----------------------------------------------------
>
>                 Key: KAFKA-20179
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20179
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Alieh Saeedi
>            Priority: Major
>
> When writing to the changelog topic, we currently need to deserialize the 
> record headers, which is not ideal from a performance perspective. We should 
> look for a way to avoid this overhead.
> One option, as suggested by [~mjsax] , would be to reuse the header bytes 
> from the Kafka message format directly, without deserialization. However, 
> this would couple Kafka Streams tightly to the internal serialization format 
> of headers in the Kafka record. Although we already use the same format 
> today, we are not yet _formally_ coupled to it. If the Kafka message format 
> for headers ever changes, this approach would directly impact Kafka Streams 
> and raise compatibility concerns (ref: 
> [https://github.com/apache/kafka/pull/21345/changes#r2761685550)|https://github.com/apache/kafka/pull/21345/changes#r2761685550].
> Another complication is that the processor context can still mutate 
> headers—for example, by adding new entries—before the record is sent to the 
> producer (see {{{}[ProcessorContextImpl#log(): 
> https://github.com/apache/kafka/blob/6d9ba767c544d600acbcc1eec8dd38d94c739b01/streams/src/main/java/org/apache/kafka/streams/processor/internals/ProcessorContextImpl.java#L136{}}}).
>  This leaves us with two options:
>  # Deserialize the headers anyway, which means we lose the optimization; or
>  # Manually merge: serialize any newly added header entries (e.g., the vector 
> clock), append them to the existing serialized headers, and then send the 
> combined header bytes to the producer.
> Neither option is entirely straightforward, so we need to weigh the 
> performance gains of avoiding deserialization against the tighter coupling 
> and added complexity this would introduce.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to