[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

Michael Andre Pearce (IG) (JIRA) Tue, 13 Dec 2016 09:43:37 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745734#comment-15745734
 ]


Michael Andre Pearce (IG) commented on KAFKA-4477:
--------------------------------------------------

Hi Jun,

The stack was taken by the automated restart script we've had to put in place 
before it restarted the nodes, which picked up the issue 20 seconds after it 
started.

The broker during the period is not under high load. We do not see any GC 
issues, nor do we see any ZK issues.

The logs we are seeing are matching those of other people, we have had this 
occur 3 times further all having very similar logs aka nothing new is showing 
up.

On a side note, we are looking to upgrade to 0.10.1.1 as soon as its released 
and we see it released by Confluent also. We do this as we expect some further 
sanity checks have occurred and use this as a measure to check no critical 
issues,

We will aim to push to UAT quickly (where we see this issue also (weirdly we 
haven't had this occur in TEST or DEV)) to see if this is resolved. What is the 
expected timeline for this? We still expecting it to be released today? And 
when would Confluent likely to complete their testing and release.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4477
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0
>         Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>            Reporter: Michael Andre Pearce (IG)
>            Assignee: Apurva Mehta
>            Priority: Critical
>              Labels: reliability
>         Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

Reply via email to