[ https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163134#comment-17163134 ]
Guozhang Wang commented on KAFKA-8037: -------------------------------------- Great comments from all of you, thank you so much! I'd like to see if I can gasp the core of this conversation so far: it seems we all agree that Streams has to ask for users's input rather than making "some decision" itself without incurring edge cases, but we are not agreeing on what that "decision" should be. So far I've seen two options here: 1) we ask users that per-source-table, if we can optimize away the changelog topic and just reuse the source topic; if "yes" we just treat it as normal changelog topics during restoration, i.e. no serde, no exception handler, etc. 2) per-source-table, we always optimize away the changelog and piggy-back on the source topic, but we ask users if the restoration process needs to be specially handled just like normal processing, i.e. "deserialize, exceptional handling, serialize", or not. Note I've excluded global state stores out of this discussion so far since now I think for that case we should always piggy-back on source, and we should always treat restoration just like normal processing. Aside from user's perspective, the difference between 1) and 2) brings me the performance / footprint related question: whether keeping an extra duplicated topic (but you never do serde) is worse, or longer restoration latency is worse. There are of course other considerata like if we keep a separate topic then we would not rely on source topic's retention etc, but I'd like to ask this trade-off question first because I cannot say that I have a clear winner in my mind :) For now I'm a bit leaning towards having an extra topic for better restoration latency since in the long term, storing more topics in Kafka may not be a huge deal (but I'd have to admit that more topics today is still a top argument against Kafka streams today as [~ableegoldman] said), though I am very welcoming any counter arguments here. > KTable restore may load bad data > -------------------------------- > > Key: KAFKA-8037 > URL: https://issues.apache.org/jira/browse/KAFKA-8037 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Matthias J. Sax > Priority: Minor > Labels: pull-request-available > > If an input topic contains bad data, users can specify a > `deserialization.exception.handler` to drop corrupted records on read. > However, this mechanism may be by-passed on restore. Assume a > `builder.table()` call reads and drops a corrupted record. If the table state > is lost and restored from the changelog topic, the corrupted record may be > copied into the store, because on restore plain bytes are copied. > If the KTable is used in a join, an internal `store.get()` call to lookup the > record would fail with a deserialization exception if the value part cannot > be deserialized. > GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for > GlobalKTable case). It's unclear to me atm, how this issue could be addressed > for KTables though. > Note, that user state stores are not affected, because they always have a > dedicated changelog topic (and don't reuse an input topic) and thus the > corrupted record would not be written into the changelog. -- This message was sent by Atlassian Jira (v8.3.4#803005)