[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182552#comment-15182552 ] ASF GitHub Bot commented on KAFKA-2960: --- GitHub user becketqin opened a pull request: https://github.com/apache/kafka/pull/1018 KAFKA-2960: Clear purgatory for partitions before becoming follower You can merge this pull request into a Git repository by running: $ git pull https://github.com/becketqin/kafka KAFKA-2960 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/kafka/pull/1018.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1018 commit 6ee590bc8f65217227c8bda98644dce35ed0d701 Author: Jiangjie QinDate: 2016-03-07T04:04:45Z KAFKA-2960: Clear purgatory for partition before becoming follower > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang >Assignee: Jiangjie Qin > Fix For: 0.10.0.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049028#comment-15049028 ] Jiangjie Qin commented on KAFKA-2960: - [~guozhang] Got it. Thanks for the explanation. > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047198#comment-15047198 ] Jiangjie Qin commented on KAFKA-2960: - [~iBuddha] In Kafka, the persistence guarantee are at different levels. Would the following settings solve the scenario you mentioned? acks=-1 min.isr=2 replication factor > 2 This should guarantee when response was sent, at least two brokers in the ISR has persisted the messages. So there should be no message loss unless the entire cluster is down. > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047753#comment-15047753 ] Xing Huang commented on KAFKA-2960: --- The Partition class check log end offset and current ISR to decide if there's enough replicas. But after a leader become follower, it may truncate its log, and if it became leader again very quickly, there is a chance that another client sent messages to it, so the LEO will increase, and the current ISR has changed to 2, so the DelayedProduce is satisfied, even acks=-1 and min.isr=2 and replica.factor=3 > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047825#comment-15047825 ] Jiangjie Qin commented on KAFKA-2960: - Not sure I follow the issue, in that case the ProducerRequest to the new leader will sit in Purgatory until the new follower have the message. > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048206#comment-15048206 ] Xing Huang commented on KAFKA-2960: --- We use Kafka 0.8.2. The produce request timeout is the default value, which is 1ms. At 39:07, controller did a preferred leader election, sent LeaderAndIsrRequest to replicas. Just one or two seconds later, it found the broker hosted the new leader failed. So, the controller did another leader election, and send the second batch of LeaderAndIsrRequests. At 40:11, the related replicas processed the first LeaderAndIsrRequest. At 40:12, they processed the second LeaderAndIsrRequest. So, the original leader experienced a leader -> follower -> leader change in just two seconds, I think. > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change
[ https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048115#comment-15048115 ] Guozhang Wang commented on KAFKA-2960: -- [~becket_qin] I think the issue here is that when a broker becomes follower, its delayed produce request does NOT get cleaned and returned an error code to the producer, but will still sit in the purgatory. If its producer timeout is long enough to not being timed out, it can be incorrectly satisfied when the follower becomes leader again. For example let's say we have two brokers: Broker 1 is the leader with its current LEO 50, HW 50. Broker 2 is follower with current LEO 50, HW 50. 1) broker 1 gets one message "a" with ack = all and append with offset 51, and its LEO is 51. 2) this produce request sit in the purgatory for broker 2 to replicate. 3) broker 1 becomes the follower and broker 2 becomes leader. 4) broker 1 sees broker 2's HW is 50, so it will truncate out message "a" and reset its LEO to 50. 5) broker 1 becomes leader again and broker 2 becomes follower again. 6) broker 1 gets another message "b", append with offset 51. 7) broker 2 replicates message "b". 8) broker 1 now advanced its HW to 51, and satisfying both produce requests for "a" and "b" based on the offset, but "a" is actually truncated. [~peoplebike] I'm wondering in your case, what is the produce request timeout value to trigger this issue? And how long did you observe the original leader to transit to follower and back to leader again? > DelayedProduce may cause message lose during repeatly leader change > --- > > Key: KAFKA-2960 > URL: https://issues.apache.org/jira/browse/KAFKA-2960 > Project: Kafka > Issue Type: Bug > Components: core >Affects Versions: 0.9.0.0 >Reporter: Xing Huang > Fix For: 0.9.1.0 > > > related to #KAFKA-1148 > When a leader replica became follower then leader again, it may truncated its > log as follower. But the second time it became leader, its ISR may shrink and > if at this moment new messages were appended, the DelayedProduce generated > when it was leader the first time may be satisfied, and the client will > receive a response with no error. But, actually the messages were lost. > We simulated this scene, which proved the message lose could happen. And it > seems to be the reason for a data lose recently happened to us according to > broker logs and client logs. > I think we should check the leader epoch when send a response, or satisfy > DelayedProduce when leader change as described in #KAFKA-1148. > And we may need an new error code to inform the producer about this error. -- This message was sent by Atlassian JIRA (v6.3.4#6332)