[ 
https://issues.apache.org/jira/browse/KAFKA-9895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307987#comment-17307987
 ] 

Dmitriy Poluyanov commented on KAFKA-9895:
------------------------------------------

Bumped into same issue today. Got under-replicated partition 
{{__consumer_offsets-42}} during plain rolling broker restart,

I've extracted some logs around error message:
{code:java}
[2021-03-24 18:33:06,431] INFO [Log partition=__consumer_offsets-42, 
dir=/srv/disk03/kafka-logs] Truncating to offset 2841693383 (kafka.log.Log)
[2021-03-24 18:33:06,431] INFO [Log partition=__consumer_offsets-42, 
dir=/srv/disk03/kafka-logs] Scheduling segments for deletion List() 
(kafka.log.Log)
[2021-03-24 18:33:06,437] ERROR [ReplicaFetcher replicaId=8, leaderId=1, 
fetcherId=0] Unexpected error occurred during truncation for 
__consumer_offsets-42 at offset 2841693383 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.OffsetOutOfRangeException: Received request for 
offset 2841693384 for partition __consumer_offsets-42, but we only have log 
segments in the range 0 to 2841693383.
[2021-03-24 18:33:06,441] WARN [ReplicaFetcher replicaId=8, leaderId=1, 
fetcherId=0] Partition __consumer_offsets-42 marked as failed 
(kafka.server.ReplicaFetcherThread)
[2021-03-24 18:33:23,443] INFO [GroupMetadataManager brokerId=8] Scheduling 
unloading of offsets and group metadata from __consumer_offsets-42 
(kafka.coordinator.group.GroupMetadataManager)
[2021-03-24 18:33:23,444] INFO [GroupMetadataManager brokerId=8] Finished 
unloading __consumer_offsets-42. Removed 0 cached offsets and 0 cached groups. 
(kafka.coordinator.group.GroupMetadataManager)
[2021-03-24 18:34:06,432] INFO [Log partition=__consumer_offsets-42, 
dir=/srv/disk03/kafka-logs] Deleting segments List() (kafka.log.Log) {code}
 

So, there are not only OffsetOutOfRange error but also partition unloading and 
detaching it from replication streams.

We are on kafka 2.4.1 now, so the problem is actual at least for kafka-2.4.1. 

After second restart we've stuck with {{__consumer_offsets-33 with same 
problem.}}
{code:java}
[2021-03-24 19:19:51,981] INFO [Log partition=__consumer_offsets-33, 
dir=/srv/disk04/kafka-logs] Truncating to offset 2002630095 (kafka.log.Log)
[2021-03-24 19:19:51,981] INFO [Log partition=__consumer_offsets-33, 
dir=/srv/disk04/kafka-logs] Scheduling segments for deletion List() 
(kafka.log.Log)
[2021-03-24 19:19:51,982] ERROR [ReplicaFetcher replicaId=8, leaderId=3, 
fetcherId=0] Unexpected error occurred during truncation for 
__consumer_offsets-33 at offset 2002630095 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.OffsetOutOfRangeException: Received request for 
offset 2002630096 for partition __consumer_offsets-33, but we only have log 
segments in the range 0 to 2002630095.
[2021-03-24 19:19:51,986] WARN [ReplicaFetcher replicaId=8, leaderId=3, 
fetcherId=0] Partition __consumer_offsets-33 marked as failed 
(kafka.server.ReplicaFetcherThread) {code}
The broker itself works normally, just stops for receiving any updates from 
problem partition

> Truncation request on broker start up may cause OffsetOutOfRangeException
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-9895
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9895
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Boquan Tang
>            Priority: Major
>
> We have a 4 broker cluster running version 2.4.0.
> Upon broker restart, we frequently observe issue like this:
> {code}
> [2020-04-20 20:36:37,827] ERROR [ReplicaFetcher replicaId=4, leaderId=1, 
> fetcherId=0] Unexpected error occurred during truncation for topic-name-10 at 
> offset 632111354 (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.common.errors.OffsetOutOfRangeException: Received request 
> for offset 632111355 for partition active-ads-10, but we only have log 
> segments in the range 0 to 632111354.
> {code}
> The partition experiencing this issue seems random. Could we actually ignore 
> this kind of error and not put this partition to offline? From what the error 
> log describes, I think once the start up finishes, and the partition catches 
> up with leader, it should be OK to put it back to ISR. Please help me if I'm 
> understanding it incorrectly.
> This happens after we updated to 2.4.0, so I'm wondering if it has anything 
> to do with this specific version or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to