[jira] [Commented] (KAFKA-13177) partition failures and fewer shrink but a lot of isr expansions with increased num.replica.fetchers in kafka brokers

kaushik srinivas (Jira) Mon, 09 Aug 2021 00:31:27 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395873#comment-17395873
 ]


kaushik srinivas commented on KAFKA-13177:
------------------------------------------

This is a expected behavior to occur. This happens when the epoch of the leader 
changes either due to bouncing of brokers or controllers. replica fetchers 
would be removed and added when ever these epoch changes. 

These logs vanished once after the brokers are up and running. And is expected 
to happen when the brokers restart or reassignment happens.

> partition failures and fewer shrink but a lot of isr expansions with 
> increased num.replica.fetchers in kafka brokers
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13177
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13177
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: kaushik srinivas
>            Assignee: kaushik srinivas
>            Priority: Major
>
> Installing 3 node kafka broker cluster (4 core cpu and 4Gi memory on k8s)
> topics : 15, partitions each : 15 replication factor 3, min.insync.replicas  
> : 2
> producers running with acks : all
> Initially the num.replica.fetchers was set to 1 (default) and we observed 
> very frequent ISR shrinks and expansions. So the setups were tuned with a 
> higher value of 4. 
> Once after this change was done, we see below behavior and warning msgs in 
> broker logs
>  # Over a period of 2 days, there are around 10 shrinks corresponding to 10 
> partitions, but around 700 ISR expansions corresponding to almost all 
> partitions in the cluster(approx 50 to 60 partitions).
>  # we see frequent warn msg of partitions being marked as failure in the same 
> time span. Below is the trace --> {"type":"log", "host":"wwwwww", 
> "level":"WARN", "neid":"kafka-wwwwww", "system":"kafka", 
> "time":"2021-08-03T20:09:15.340", "timezone":"UTC", 
> "log":{"message":"ReplicaFetcherThread-2-1003 - 
> kafka.server.ReplicaFetcherThread - *[ReplicaFetcher replicaId=1001, 
> leaderId=1003, fetcherId=2] Partition test-16 marked as failed"}}*
>  
> We see the above behavior continuously after increasing the 
> num.replica.fetchers to 4 from 1. We did increase this to improve the 
> replication performance and hence reduce the ISR shrinks.
> But we see this strange behavior after the change. What would the above trace 
> indicate and is marking partitions as failed just a WARN msgs and handled by 
> kafka or is it something to worry about ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13177) partition failures and fewer shrink but a lot of isr expansions with increased num.replica.fetchers in kafka brokers

Reply via email to