[jira] [Commented] (KAFKA-19430) Don't fail on RecordCorruptedException

Lianet Magrans (Jira) Thu, 14 Aug 2025 11:34:09 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-19430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18013984#comment-18013984
 ]


Lianet Magrans commented on KAFKA-19430:
----------------------------------------

{quote}Why does it extend RetriableException if it's not considered retriable?
{quote}
I imagine it could be because it could be retriable on the produce path (bit 
flip that would make the CRC check fail before writing to the log, but haven't 
fully checked the broker write-path code).
{quote}what the root cause of such an error actually is? Would it be some "bit 
flip" that should actually not happen again if we retry? Or some other env 
related issue? Or is it some permanent corrupted that won't go away (wondering 
where such a permanent corruption would originate from)?
{quote}
Both I would imagine (transient or permanent), but in different paths 
(read/write). Let's note that this exception is originally generated on the 
broker side only from what I see, so:
 * get it on the consumer path: I expect it means the broker identified the 
data as corrupted when reading from the log -> I would expect this is not 
retriable (ex. disk corrupted)
 * get it on the produce path: means the broker identified the data as 
corrupted when attempting to write it  -> I would expect this makes sense 
retriable (ex. bit flip)

{quote}So the position does not advance at all, and if we would call `poll()` 
again we just fetch the same batch (and potentially fail again with the same 
error)?
{quote}
Exactly, positions only advance when we do successful fetching (or manually 
advancing it, ex. seek), This case of failed fetch with CorruptedRecord would 
not move the position (that's why it's kind of an end state for the poll loop, 
breaks it, the consumer cannot make progress and cannot decide on its own 
how/what to skip)
{quote}Just passing in `CorruptRecordException` as nested error into the thrown 
`KafkaException` could give enough signal for KS, and should not require a KIP?
{quote}
Agree, good point, no KIP, but regarding "could give enough signal for KS", if 
it's on the produce path then I expect having that signal could lead somewhere 
(retry and fix).  On the consume path I expect it would lead somewhere only if 
the broker can fetch the data from the same position (not sure if this case of 
transient corruption on log read on the broker could happen), or if on the KS 
app we advance the positions to skip the problematic recods. On this last 
options is where it's not clear to me we would have the info to know what to 
skip (where to seek to)

Very interesting!

> Don't fail on RecordCorruptedException
> --------------------------------------
>
>                 Key: KAFKA-19430
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19430
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Assignee: Uladzislau Blok
>            Priority: Major
>
> From [https://github.com/confluentinc/kafka-streams-examples/issues/524]
> Currently, the existing `DeserializationExceptionHandler` is applied when 
> de-serializing the record key/value byte[] inside Kafka Streams. This implies 
> that a `RecordCorruptedException` is not handled.
> We should explore to not let Kafka Streams crash, but maybe retry this error 
> automatically (as `RecordCorruptedException extends RetriableException`), and 
> find a way to pump the error into the existing exception handler.
> If the error is transient, user can still use `REPLACE_THREAD` in the 
> uncaught exception handler, but this is a rather heavy weight approach.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-19430) Don't fail on RecordCorruptedException

Reply via email to