dajac opened a new pull request, #22466:
URL: https://github.com/apache/kafka/pull/22466
During routine `__consumer_offsets` partition leadership changes, the
group coordinator spams ERROR-level logs for every in-flight write at
the moment of transition:
[GroupCoordinator id=N] Writing records to __consumer_offsets-N failed
due to: For requests intended only for the leader, this error indicates that
the broker is not the current leader ...
[GroupCoordinator id=N] Execution of FlushBatch failed due to For
requests intended only for the leader, this error indicates that the broker is
not the current leader ...
These appear on the group coordinator that lost leadership and last for
the duration of the in-flight batch queue. The behavior is correct:
`NotLeaderOrFollowerException` propagates through `failCurrentBatch` to
the deferred events and is mapped to `NOT_COORDINATOR` for clients via
`CoordinatorOperationExceptionHelper`, so clients retry against the new
coordinator. This is purely a logging-noise issue.
This patch classifies the flush failures in `flushCurrentBatch`:
- `NotLeaderOrFollowerException` (leadership change) and
`InvalidProducerEpochException` (fenced producer) are logged at DEBUG;
- `RecordTooLargeException` and `RecordBatchTooLargeException` remain at
ERROR with added context (record count and maximum batch size);
- `NotCoordinatorException`, thrown when the state machine is out of
sync, is not re-logged since that condition is already logged;
- any other failure is logged at ERROR as an unexpected error.
The redundant re-log from the `FlushBatch` event is also suppressed so
that a failure is logged once.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]