[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2016-03-06 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15182552#comment-15182552
 ] 

ASF GitHub Bot commented on KAFKA-2960:
---

GitHub user becketqin opened a pull request:

https://github.com/apache/kafka/pull/1018

KAFKA-2960: Clear purgatory for partitions before becoming follower



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/becketqin/kafka KAFKA-2960

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/1018.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1018


commit 6ee590bc8f65217227c8bda98644dce35ed0d701
Author: Jiangjie Qin 
Date:   2016-03-07T04:04:45Z

KAFKA-2960: Clear purgatory for partition before becoming follower




> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
>Assignee: Jiangjie Qin
> Fix For: 0.10.0.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2015-12-09 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049028#comment-15049028
 ] 

Jiangjie Qin commented on KAFKA-2960:
-

[~guozhang] Got it. Thanks for the explanation.

> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
> Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2015-12-08 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047198#comment-15047198
 ] 

Jiangjie Qin commented on KAFKA-2960:
-

[~iBuddha] In Kafka, the persistence guarantee are at different levels. Would 
the following settings solve the scenario you mentioned?
acks=-1
min.isr=2
replication factor > 2

This should guarantee when response was sent, at least two brokers in the ISR 
has persisted the messages. So there should be no message loss unless the 
entire cluster is down.

> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
> Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2015-12-08 Thread Xing Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047753#comment-15047753
 ] 

Xing Huang commented on KAFKA-2960:
---

The Partition class  check log end offset and current ISR to decide if there's 
enough replicas. But after a leader become follower, it may truncate its log, 
and if it became leader again very quickly, there is a chance that another 
client sent messages to it, so the LEO will increase, and the current ISR has 
changed to 2, so the DelayedProduce is satisfied, even acks=-1 and min.isr=2 
and replica.factor=3

> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
> Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2015-12-08 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047825#comment-15047825
 ] 

Jiangjie Qin commented on KAFKA-2960:
-

Not sure I follow the issue, in that case the ProducerRequest to the new leader 
will sit in Purgatory until the new follower have the message.

> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
> Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2015-12-08 Thread Xing Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048206#comment-15048206
 ] 

Xing Huang commented on KAFKA-2960:
---

We use Kafka 0.8.2. The produce request timeout is the default value, which is 
1ms.
At 39:07, controller did a preferred leader election, sent LeaderAndIsrRequest 
to replicas. Just one or two seconds later, it found the broker hosted the new 
leader failed. So, the controller did another leader election, and send the 
second batch of LeaderAndIsrRequests.
At 40:11, the related replicas processed the first LeaderAndIsrRequest.
At 40:12, they processed the second LeaderAndIsrRequest.
So, the original leader experienced a leader -> follower -> leader change in 
just two seconds, I think. 

> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
> Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2960) DelayedProduce may cause message lose during repeatly leader change

2015-12-08 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15048115#comment-15048115
 ] 

Guozhang Wang commented on KAFKA-2960:
--

[~becket_qin] I think the issue here is that when a broker becomes follower, 
its delayed produce request does NOT get cleaned and returned an error code to 
the producer, but will still sit in the purgatory. If its producer timeout is 
long enough to not being timed out, it can be incorrectly satisfied when the 
follower becomes leader again. For example let's say we have two brokers:

Broker 1 is the leader with its current LEO 50, HW 50.
Broker 2 is follower with current LEO 50, HW 50.

1) broker 1 gets one message "a" with ack = all and append with offset 51, and 
its LEO is 51.
2) this produce request sit in the purgatory for broker 2 to replicate.
3) broker 1 becomes the follower and broker 2 becomes leader.
4) broker 1 sees broker 2's HW is 50, so it will truncate out message "a" and 
reset its LEO to 50.
5) broker 1 becomes leader again and broker 2 becomes follower again.
6) broker 1 gets another message "b", append with offset 51.
7) broker 2 replicates message "b".
8) broker 1 now advanced its HW to 51, and satisfying both produce requests for 
"a" and "b" based on the offset, but "a" is actually truncated.

[~peoplebike] I'm wondering in your case, what is the produce request timeout 
value to trigger this issue? And how long did you observe the original leader 
to transit to follower and back to leader again?


> DelayedProduce may cause message lose during repeatly leader change
> ---
>
> Key: KAFKA-2960
> URL: https://issues.apache.org/jira/browse/KAFKA-2960
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.9.0.0
>Reporter: Xing Huang
> Fix For: 0.9.1.0
>
>
> related to #KAFKA-1148
> When a leader replica became follower then leader again, it may truncated its 
> log as follower. But the second time it became leader, its ISR may shrink and 
> if at this moment new messages were appended, the DelayedProduce generated 
> when it was leader the first time may be satisfied, and the client will 
> receive a response with no error. But, actually the messages were lost. 
> We simulated this scene, which proved the message lose could happen. And it 
> seems to be the reason for a data lose recently happened to us according to 
> broker logs and client logs.
> I think we should check the leader epoch when send a response, or satisfy 
> DelayedProduce when leader change as described in #KAFKA-1148.
> And we may need an new error code to inform the producer about this error. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)