[ https://issues.apache.org/jira/browse/KAFKA-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013984#comment-18013984 ]
Lianet Magrans commented on KAFKA-19430: ---------------------------------------- {quote}Why does it extend RetriableException if it's not considered retriable? {quote} I imagine it could be because it could be retriable on the produce path (bit flip that would make the CRC check fail before writing to the log, but haven't fully checked the broker write-path code). {quote}what the root cause of such an error actually is? Would it be some "bit flip" that should actually not happen again if we retry? Or some other env related issue? Or is it some permanent corrupted that won't go away (wondering where such a permanent corruption would originate from)? {quote} Both I would imagine (transient or permanent), but in different paths (read/write). Let's note that this exception is originally generated on the broker side only from what I see, so: * get it on the consumer path: I expect it means the broker identified the data as corrupted when reading from the log -> I would expect this is not retriable (ex. disk corrupted) * get it on the produce path: means the broker identified the data as corrupted when attempting to write it -> I would expect this makes sense retriable (ex. bit flip) {quote}So the position does not advance at all, and if we would call `poll()` again we just fetch the same batch (and potentially fail again with the same error)? {quote} Exactly, positions only advance when we do successful fetching (or manually advancing it, ex. seek), This case of failed fetch with CorruptedRecord would not move the position (that's why it's kind of an end state for the poll loop, breaks it, the consumer cannot make progress and cannot decide on its own how/what to skip) {quote}Just passing in `CorruptRecordException` as nested error into the thrown `KafkaException` could give enough signal for KS, and should not require a KIP? {quote} Agree, good point, no KIP, but regarding "could give enough signal for KS", if it's on the produce path then I expect having that signal could lead somewhere (retry and fix). On the consume path I expect it would lead somewhere only if the broker can fetch the data from the same position (not sure if this case of transient corruption on log read on the broker could happen), or if on the KS app we advance the positions to skip the problematic recods. On this last options is where it's not clear to me we would have the info to know what to skip (where to seek to) Very interesting! > Don't fail on RecordCorruptedException > -------------------------------------- > > Key: KAFKA-19430 > URL: https://issues.apache.org/jira/browse/KAFKA-19430 > Project: Kafka > Issue Type: Improvement > Components: streams > Reporter: Matthias J. Sax > Assignee: Uladzislau Blok > Priority: Major > > From [https://github.com/confluentinc/kafka-streams-examples/issues/524] > Currently, the existing `DeserializationExceptionHandler` is applied when > de-serializing the record key/value byte[] inside Kafka Streams. This implies > that a `RecordCorruptedException` is not handled. > We should explore to not let Kafka Streams crash, but maybe retry this error > automatically (as `RecordCorruptedException extends RetriableException`), and > find a way to pump the error into the existing exception handler. > If the error is transient, user can still use `REPLACE_THREAD` in the > uncaught exception handler, but this is a rather heavy weight approach. -- This message was sent by Atlassian Jira (v8.20.10#820010)