mjsax commented on PR #20254:
URL: https://github.com/apache/kafka/pull/20254#issuecomment-3134139853

   Thanks for the PR. -- I did check it out and run it locally, and believe 
it's actually a bug inside `KafkaProducer`.
   
   Before Kafka Streams commits offsets, it does call `producer.flush()` but 
this call does not block as it should until all requests are sent, but it 
returns early. -- Expected behavior would be, that `producer.flush()` blocks, 
and eventual fails with a `TimeoutException` hitting `delivery.timeout.ms`.
   
   The problem inside the producer seems to be, that it initializes a flush by 
bumping `RecordAccumulator.flushesInProgress` -- when the "message too large" 
error return, it tries to split the batch, and does make the original (too 
large) batch as "done" which decrements the `flushesInProgress` counter to zero 
(leading to the early return of `flush()` which believes the flush is 
completed), what should not happen, as we are still in the middle of the flush, 
as new (smaller) batches get enqueued.
   
   There seems to be some other issue in splitting batches though, as there is 
repeated logs:
   ```
   WARN [Producer 
clientId=app-shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLargewWTlaRhOS_68DNGthXrNsQ-d363a0a7-08d4-4f0d-b5f1-3347745661be-StreamThread-1-producer]
 Got error produce response in correlation id 47 on topic-partition 
output-shouldNotCommitOffsetsAndNotProduceOutputRecordsWhenProducerFailsWithMessageTooLargewWTlaRhOS_68DNGthXrNsQ-0,
 splitting and retrying (2147483647 attempts left). Error: MESSAGE_TOO_LARGE 
(org.apache.kafka.clients.producer.internals.Sender:667)
   ```
   It seems, the producer is actually not able to split the batch, what I don't 
understand. Record size is small in the test setup, and batch size is set to 
32MB... Also the counter `splitting and retrying (2147483647 attempts left)` 
does not decrease (not sure if this is right or wrong -- it seems, when we 
split a batch, and retry, the new batch(es) get a fresh retry count, what might 
be correct, but also somehow seems off?) -- This not being able to split a 
batch, is just a side issue, as there could also be the case of a single large 
message that cannot be split.
   
   I am not a producer expert, and thus, I am not 100% sure right now what the 
right fix should be -- it seems to be a more difficult fix. \cc @lianetm 
@kirktrue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to