[jira] [Commented] (KAFKA-8037) KTable restore may load bad data

Almog Gavra (Jira) Wed, 22 Jul 2020 16:20:13 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-8037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163133#comment-17163133
 ]


Almog Gavra commented on KAFKA-8037:
------------------------------------

> Can you give a specific example of how things would break due to the 
> asymmetric JSON/AVRO serdes, and/or the schema registry side effects?

This depends on what you mean with “how things would break”. Is it matter of 
correctness that the state stores are identical before and after recovery? If I 
have a serde which only serializes half the fields (something ksqlDB supports 
today, and a feature that is useful for people that don't control the input 
data), then after I recover my local disk usage might be twice as much and 
lookups into the state store cost more to deserialize. Nothing "breaks" in the 
sense that you'll be able to deserialize all of that data and continue 
processing, but the efficiency of your system changes.

Things get a little worse from a side-effect perspective; specifically 
confluent schema registry serdes. If we're reusing the source topic, then 
naturally we should pass the source topic as the name to the serde (or else 
face KAFKA-10179). If I use a serializer that is not identical to what was 
writing to the topic to write to the state store, I will register schemas into 
the source topic subject - this might break other producers (something that 
ksqlDB users complain about: 
[https://github.com/confluentinc/ksql/issues/5553|https://github.com/confluentinc/ksql/issues/5553).]).

FWIW, the way we're working around this in ksql is to always serialize using 
the "phantom" changelog subject, and deserialize using both.

> KTable restore may load bad data
> --------------------------------
>
>                 Key: KAFKA-8037
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8037
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Minor
>              Labels: pull-request-available
>
> If an input topic contains bad data, users can specify a 
> `deserialization.exception.handler` to drop corrupted records on read. 
> However, this mechanism may be by-passed on restore. Assume a 
> `builder.table()` call reads and drops a corrupted record. If the table state 
> is lost and restored from the changelog topic, the corrupted record may be 
> copied into the store, because on restore plain bytes are copied.
> If the KTable is used in a join, an internal `store.get()` call to lookup the 
> record would fail with a deserialization exception if the value part cannot 
> be deserialized.
> GlobalKTables are affected, too (cf. KAFKA-7663 that may allow a fix for 
> GlobalKTable case). It's unclear to me atm, how this issue could be addressed 
> for KTables though.
> Note, that user state stores are not affected, because they always have a 
> dedicated changelog topic (and don't reuse an input topic) and thus the 
> corrupted record would not be written into the changelog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8037) KTable restore may load bad data

Reply via email to