Yeah, I think this is a reasonable thing to do. I've been going back and forth on it.
The downside of storing it serialized is then we need to deserialize it to emit it. This is a moot point for the (planned) on-disk implementation, but for in-memory it saves some CPU and possibly some GC pressure not to round-trip it through a byte array. As is, we serialize it just once instead of serialize + deserialize. Plus we currently discard the produced array immediately, so it's easy on the gc, whereas if we keep it, then we have 3 medium-to-long term objects: the incoming record, the serialized array, and the (deserialized) outgoing record. Is this premature optimization? Possibly. Some other factors to consider: when we send to the changelog, we'll need to serialize it anyway. But I'm planning to send only on `flush` and to keep the changelog buffer compact with a LinkedHashmap, so records that get updated or retracted several times within a commit interval would only get serialized once. Plus, for this purpose, we still only need the `serialize` side; we could hang onto the produced array after computing the size long enough to send it to the changelogger. For changelogging purposes, we'd only need to deserialize when we recover on startup, not in steady-state operations, so I think it's still more economical to store the records as objects instead of serialized. It is true that there's really no tight correlation between the heap used by an object and the heap used by its serialized form. So at the moment, we're only roughly obeying the size limit. For primitive data, it's probably pretty close, though. I'm open to either way of doing it, but that was my thinking. What say you? [ Full content available at: https://github.com/apache/kafka/pull/5693 ] This message was relayed via gitbox.apache.org for [email protected]
