Yeah, I think this is a reasonable thing to do. I've been going back and forth 
on it.

The downside of storing it serialized is then we need to deserialize it to emit 
it. This is a moot point for the (planned) on-disk implementation, but for 
in-memory it saves some CPU and possibly some GC pressure not to round-trip it 
through a byte array.

As is, we serialize it just once instead of serialize + deserialize. Plus we 
currently discard the produced array immediately, so it's easy on the gc, 
whereas if we keep it, then we have 3 medium-to-long term objects: the incoming 
record, the serialized array, and the (deserialized) outgoing record. Is this 
premature optimization? Possibly.

Some other factors to consider: when we send to the changelog, we'll need to 
serialize it anyway. But I'm planning to send only on `flush` and to keep the 
changelog buffer compact with a LinkedHashmap, so records that get updated or 
retracted several times within a commit interval would only get serialized 
once. Plus, for this purpose, we still only need the `serialize` side; we could 
hang onto the produced array after computing the size long enough to send it to 
the changelogger.

For changelogging purposes, we'd only need to deserialize when we recover on 
startup, not in steady-state operations, so I think it's still more economical 
to store the records as objects instead of serialized.

It is true that there's really no tight correlation between the heap used by an 
object and the heap used by its serialized form. So at the moment, we're only 
roughly obeying the size limit. For primitive data, it's probably pretty close, 
though.

I'm open to either way of doing it, but that was my thinking. What say you?

[ Full content available at: https://github.com/apache/kafka/pull/5693 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to