mjsax commented on PR #20254: URL: https://github.com/apache/kafka/pull/20254#issuecomment-3134139853
Thanks for the PR. -- I did check it out and run it locally, and believe it's actually a bug inside `KafkaProducer`. Before Kafka Streams commits offsets, it does call `producer.flush()` but this call does not block as it should until all requests are sent, but it returns early. -- Expected behavior would be, that `producer.flush()` blocks, and eventual fails with a `TimeoutException` hitting `delivery.timeout.ms`. The problem inside the producer seems to be, that it initializes a flush by bumping `RecordAccumulator.flushesInProgress` -- when the "message too large" error return, it tries to split the batch, and does make the original (too large) batch as "done" which decrements the `flushesInProgress` counter to zero (leading to the early return of `flush()` which believes the flush is completed), what should not happen, as we are still in the middle of the flush, as new (smaller) batches get enqueued. There seems to be some other issue in splitting batches though, as there is repeated logs: ``` WARN [Producer clientId=app-shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLargewWTlaRhOS_68DNGthXrNsQ-d363a0a7-08d4-4f0d-b5f1-3347745661be-StreamThread-1-producer] Got error produce response in correlation id 47 on topic-partition output-shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLargewWTlaRhOS_68DNGthXrNsQ-0, splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE (org.apache.kafka.clients.producer.internals.Sender:667) ``` It seems, the producer is actually not able to split the batch, what I don't understand. Record size is small in the test setup, and batch size is set to 32MB... Also the counter `splitting and retrying (2147483647 attempts left)` does not decrease (not sure if this is right or wrong -- it seems, when we split a batch, and retry, the new batch(es) get a fresh retry count, what might be correct, but also somehow seems off?) -- This not being able to split a batch, is just a side issue, as there could also be the case of a single large message that cannot be split. I am not a producer expert, and thus, I am not 100% sure right now what the right fix should be -- it seems to be a more difficult fix. \cc @lianetm @kirktrue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org