[jira] [Commented] (KAFKA-8037) KTable restore may load bad data

Sophie Blee-Goldman (Jira) Wed, 22 Jul 2020 14:38:14 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163085#comment-17163085
 ]


Sophie Blee-Goldman commented on KAFKA-8037:
--------------------------------------------

> many users might not even know if their serde is symmetric

If users don't know whether their serde is symmetric how can we possibly expect 
them to know whether or not they can turn on this optimization?  If  the 
asymmetric/side effects problem is inherent in some serdes, then we can't avoid 
it, sure. But if users don't fully understand all the subtleties of this 
optimization then we shouldn't expect them to make an educated decision on 
whether to turn on the optimization or not. They might turn it on just because.

On that note, I assume the general consensus is that it should be off by 
default. I agree that on the face of it that seems like the only reasonable 
choice. We should be always correct by default, and optimized as an opt-in.

That said...in my (admiteddly anecdotal) experience, the creation of extra 
topics and extra load on the brokers, etc is a major pain point for users of 
Streams. I'm pretty sure I've seen it quoted in a "why we decided against Kafka 
Streams" type article. Compare this with the problem of asymmetric serdes, for 
which we have received exactly zero complaints as far as I am aware.

I'm also still not convinced that this is a problem, but this is most likely 
just my ignorance of what these serdes are actually doing. Can you give a 
specific example of how things would break due to the asymmetric JSON/AVRO 
serdes, and/or the schema registry side effects?

For example if the side effect is just "register the schema", then it seems 
like we wouldn't have a problem (since the record would have been serialized at 
least once before). But I get the sense I'm missing some critical details in my 
understanding here :) 

> KTable restore may load bad data
> --------------------------------
>
>                 Key: KAFKA-8037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8037
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Minor
>              Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a 
> `deserialization.exception.handler` to drop corrupted records on read. 
> However, this mechanism may be by-passed on restore. Assume a 
> `builder.table()` call reads and drops a corrupted record. If the table state 
> is lost and restored from the changelog topic, the corrupted record may be 
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the 
> record would fail with a deserialization exception if the value part cannot 
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for 
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed 
> for KTables though.
> Note, that user state stores are not affected, because they always have a 
> dedicated changelog topic (and don't reuse an input topic) and thus the 
> corrupted record would not be written into the changelog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8037) KTable restore may load bad data

Reply via email to