[
https://issues.apache.org/jira/browse/KAFKA-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15745734#comment-15745734
]
Michael Andre Pearce (IG) commented on KAFKA-4477:
--------------------------------------------------
Hi Jun,
The stack was taken by the automated restart script we've had to put in place
before it restarted the nodes, which picked up the issue 20 seconds after it
started.
The broker during the period is not under high load. We do not see any GC
issues, nor do we see any ZK issues.
The logs we are seeing are matching those of other people, we have had this
occur 3 times further all having very similar logs aka nothing new is showing
up.
On a side note, we are looking to upgrade to 0.10.1.1 as soon as its released
and we see it released by Confluent also. We do this as we expect some further
sanity checks have occurred and use this as a measure to check no critical
issues,
We will aim to push to UAT quickly (where we see this issue also (weirdly we
haven't had this occur in TEST or DEV)) to see if this is resolved. What is the
expected timeline for this? We still expecting it to be released today? And
when would Confluent likely to complete their testing and release.
Cheers
Mike
> Node reduces its ISR to itself, and doesn't recover. Other nodes do not take
> leadership, cluster remains sick until node is restarted.
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-4477
> URL: https://issues.apache.org/jira/browse/KAFKA-4477
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.10.1.0
> Environment: RHEL7
> java version "1.8.0_66"
> Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
> Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
> Reporter: Michael Andre Pearce (IG)
> Assignee: Apurva Mehta
> Priority: Critical
> Labels: reliability
> Attachments: issue_node_1001.log, issue_node_1001_ext.log,
> issue_node_1002.log, issue_node_1002_ext.log, issue_node_1003.log,
> issue_node_1003_ext.log, kafka.jstack
>
>
> We have encountered a critical issue that has re-occured in different
> physical environments. We haven't worked out what is going on. We do though
> have a nasty work around to keep service alive.
> We do have not had this issue on clusters still running 0.9.01.
> We have noticed a node randomly shrinking for the partitions it owns the
> ISR's down to itself, moments later we see other nodes having disconnects,
> followed by finally app issues, where producing to these partitions is
> blocked.
> It seems only by restarting the kafka instance java process resolves the
> issues.
> We have had this occur multiple times and from all network and machine
> monitoring the machine never left the network, or had any other glitches.
> Below are seen logs from the issue.
> Node 7:
> [2016-12-01 07:01:28,112] INFO Partition
> [com_ig_trade_v1_position_event--demo--compacted,10] on broker 7: Shrinking
> ISR for partition [com_ig_trade_v1_position_event--demo--compacted,10] from
> 1,2,7 to 7 (kafka.cluster.Partition)
> All other nodes:
> [2016-12-01 07:01:38,172] WARN [ReplicaFetcherThread-0-7], Error in fetch
> kafka.server.ReplicaFetcherThread$FetchRequest@5aae6d42
> (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 7 was disconnected before the response was
> read
> All clients:
> java.util.concurrent.ExecutionException:
> org.apache.kafka.common.errors.NetworkException: The server disconnected
> before a response was received.
> After this occurs, we then suddenly see on the sick machine an increasing
> amount of close_waits and file descriptors.
> As a work around to keep service we are currently putting in an automated
> process that tails and regex's for: and where new_partitions hit just itself
> we restart the node.
> "\[(?P<time>.+)\] INFO Partition \[.*\] on broker .* Shrinking ISR for
> partition \[.*\] from (?P<old_partitions>.+) to (?P<new_partitions>.+)
> \(kafka.cluster.Partition\)"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)