[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

Michael Andre Pearce (IG) (JIRA) Tue, 13 Dec 2016 20:56:35 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15747271#comment-15747271
 ]


Michael Andre Pearce (IG) edited comment on KAFKA-4477 at 12/14/16 4:55 AM:
----------------------------------------------------------------------------

Hi [~apurva],

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike


was (Author: michael.andre.pearce):
Hi Apurva,

Whilst i await the issue to occur again to provide some further logs for you.

Just reading the above comment, and a query on this. 

Whilst obviously theres by the sounds of it a possible deadlock causing the ISR 
not to re-expand (though some stacks we have captured don't show this). The 
question in the first place is why even are the ISR's shrinking in the first 
place? 

Re 0.10.1.1 RC unfortunately in the environments we see it in, we will only be 
able to deploy it once 0.10.1.1 is GA/Tagged as they're UAT and PROD 
environments.

On a note it seems 0.10.0.0 doesn't seem according to others to contain this 
issue (we can only confirm 0.9.0.1 doesnt), is there any possible way to 
downgrade from 0.10.1.0 to 0.10.0.0 , is there a doc for this? Obviously all 
docs are for upgrade paths not downgrade.

Cheers
Mike

> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take 
> leadership, cluster remains sick until node is restarted.
> --------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4477
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4477
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0
>         Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
>            Reporter: Michael Andre Pearce (IG)
>            Assignee: Apurva Mehta
>            Priority: Critical
>              Labels: reliability
>         Attachments: issue_node_1001.log, issue_node_1001_ext.log, 
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log, 
> issue_node_1003_ext.log, kafka.jstack, state_change_controller.tar.gz
>
>
> We have encountered a critical issue that has re-occured in different 
> physical environments. We haven't worked out what is going on. We do though 
> have a nasty work around to keep service alive. 
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the 
> ISR's down to itself, moments later we see other nodes having disconnects, 
> followed by finally app issues, where producing to these partitions is 
> blocked.
> It seems only by restarting the kafka instance java process resolves the 
> issues.
> We have had this occur multiple times and from all network and machine 
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition 
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking 
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from 
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch 
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42 
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was 
> read
> All clients:
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.NetworkException: The server disconnected 
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing 
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated 
> process that tails and regex's for: and where new_partitions hit just itself 
> we restart the node. 
> "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for 
> partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+) 
> \(kafka.cluster.Partition\)"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-4477) Node reduces its ISR to itself, and doesn't recover. Other nodes do not take leadership, cluster remains sick until node is restarted.

Reply via email to