[
https://issues.apache.org/jira/browse/KAFKA-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ismael Juma updated KAFKA-2827:
-------------------------------
Affects Version/s: (was: 0.9.0.1)
0.9.0.0
> Producer metadata stall
> -----------------------
>
> Key: KAFKA-2827
> URL: https://issues.apache.org/jira/browse/KAFKA-2827
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.9.0.0
> Reporter: Geoff Anderson
>
> This is a tracking issue for a suspicious behavior that has been plaguing the
> stability of certain system tests.
> Producer metadata requests and produce requests can be blocked for "quite a
> while" (tens of seconds to minutes) after a hard bounce of a broker has
> occurred despite the fact that new leaders for for the topicPartitions on the
> bounced broker have already been reelected.
> This shows up in particular during ducktape replication tests where the
> leader of a topic is bounced with a hard kill. The initial symptom was the
> console consumer timing out due to no messages being available for
> consumption during this lengthy producer "stall" period.
> This behavior does not appear to be present in kafka 0.8.2: 15 of 15 runs of
> the test in question passed with 0.8.2 substituted for trunk.
> [~benstopford]'s findings so far:
> Theory:
> The first node dies, Node1. A new controller is elected and it correctly
> reassigns the partition leaders to be the other two nodes (Node2, Node3).
> However, for some reason, Node1 remains in the ISR list for two of the three
> partitions:
> Broker 3 cached leader info (Leader:3,ISR:3,2, AllReplicas:1,3,2) for
> partition test_topic,0
> Broker 3 cached leader info (Leader:3,ISR:3,2,1, AllReplicas:3,2,1) for
> partition test_topic,2
> Broker 3 cached leader info (Leader:2,ISR:2,1,3, AllReplicas:2,1,3) for
> partition test_topic,1
> (a little later broker 1 is dropped from the ISR for partition 2, partition 3
> keeps all three replicas throughout)
> Meanwhile the producer is sending requests with acks=-1. These block on the
> server because the ISR for partitions 1,2 still contain the dead Node1. As an
> aside, metadata requests from the producer are blocked too during this period
> (they are present if you look in the producer’s request queue).
> The consumer times out as it hasn’t consumed any data for its timeout period
> (this is a timeout in the console_consumer). This is because no data is
> produced during this period, as the producer is blocked. (The producer is
> blocked usually for around 15 seconds but this varies from run to run).
> When Node1 is restarted everything clears as the fetchers from Node1 can
> replicate data again. The Producer is updated with metadata and continues
> producing. All is back to normal.
> The next node goes down and the whole circle loops round again.
> Unanswered questions:
> I don’t know why we end up in a situation where the ISR is empty. isr=[].
> What I do know is this always coincides with it being in a
> REPLICA_NOT_AVAILABLE or LEADER_NOT_AVAILABLE error state
> (partition_error_code=5/9).
> To Do:
> Dig into why the dead node is not removed from the ISR
> To get the test running for now one option would be to wait for the isr to
> return to isr=1,2,3 after each node is restarted.
> Sample Timeline
> 18:45:59,606 - consumer gets first disconnection error (n1 is dead)
> 18:45:59,500 - producer gets first disconnection error (n1 is dead)
> 18:46:06,001 - ZK times out node1
> 18:46:06,321 - Broker 3 (Controller) Selects new leader and ISR
> {"leader":3,"leader_epoch":1,"isr":[3,2]}
> test_topic,0
> => so it moves the leader but it doesn’t change the ISR on the other two
> 18:46:06,168 - Broker 2 is updated by Broker 3 with ISR (leaders exclude n1
> BUT ISR still names all 3?)
> 18:46:09,582 - Consumer gives up at 10s timeout
> 18:46:14,866 - Controller is notified by ZK that node 1 is back.
> 18:46:15,845 - metadata update hits the producer (my guess is this is simply
> because node 1 is now available so replication can continue and release held
> up produce requests).
> 18:46:16,922 - producer disconnects from node 3 (node 3 dead)
> If we need a workaround for the tests we can simply set
> replica.lag.time.max.ms to a lower value, like 1000ms (so far untested)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)