[jira] [Commented] (KAFKA-8037) KTable restore may load bad data

John Roesler (Jira) Wed, 22 Jul 2020 12:07:14 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163020#comment-17163020
 ]


John Roesler commented on KAFKA-8037:
-------------------------------------

Thanks [~ableegoldman] for the thoughts.

It seems like we're looking at the same situation and seeing different things.

I agree that experience shows that understanding the full implications of this 
optimization is nontrivial. But it seems like this is a strong indication that 
we _should_ require an explicit parameter and not try to do something subtle 
and complicated that has a huge impact on user experience. Especially if 
there's a good chance we can't even do it right. And double especially if the 
path to disable it, in the case it does more harm than good, is weird and 
mysterious like, "you have to pass two different instances of the same serde 
supplier instead of the same instance of the serde supplier twice in order to 
get a changelog topic for your source table".

It seems like, since we implemented this optimization, we've discovered like 
half-a-dozen ways in which a source topic is not actually the same as a 
changelog topic. A situation like that begs for an explicit opt-in, rather than 
making the choice automatically, based on whether you wrote the app in one of 
several seemingly equivalent ways.

> KTable restore may load bad data
> --------------------------------
>
>                 Key: KAFKA-8037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8037
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Minor
>              Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a 
> `deserialization.exception.handler` to drop corrupted records on read. 
> However, this mechanism may be by-passed on restore. Assume a 
> `builder.table()` call reads and drops a corrupted record. If the table state 
> is lost and restored from the changelog topic, the corrupted record may be 
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the 
> record would fail with a deserialization exception if the value part cannot 
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for 
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed 
> for KTables though.
> Note, that user state stores are not affected, because they always have a 
> dedicated changelog topic (and don't reuse an input topic) and thus the 
> corrupted record would not be written into the changelog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8037) KTable restore may load bad data

Reply via email to