[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034892#comment-15034892
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford]  The logs from my failing test runs all show the same pattern - 
ISR set to 1 and messages acked when leader is the only ISR. When the leader 
gets killed by the test, messages are lost, as you would expect. The test was 
intended to run with min.insync.replicas set to 2, but due to a bug in the way 
min.insync.replicas was being set for topics, it was being left as default of 
one. All tests which currently set min.insync.replicas have copied the same 
config with the result that the config is never set. I have updated the PR for 
KAFKA-2642 with a fix for the min.insync.replicas setting in all the tests 
which set this. Have scheduled a build with the fix and will check the results 
in the morning.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034222#comment-15034222
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] I found an error in my analysis of KAFKA-2909 meaning that jira 
refers to actual data loss. KAFKA-2908 remains a client-side issue. This puts 
more evidence behind your theory that nodes are being killed before data is 
replicated. I'll be interested to see if this change is stable on Ec2.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033457#comment-15033457
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~geoffra] Replication tests expect all ack'ed messages to be received even 
though it runs with the default min.insync.replicas=1. The tests kills the 
leader of a partition in a loop while messages are being produced and consumed. 
This can (and does) result in ISRs dropping down to 1 (just the leader is the 
ISR list). Messages published when there are no other replicas are lost if the 
leader (the only ISR) is killed. It seems to me that the test's expectations 
are too high. When I modify the test (hard_bounce with SSL/SASL) to wait until 
there are atleast two entries in the ISR list before killing the leader, it 
passes reliably in my local test runs. I wonder if the only reason this test 
has been working is because PLAINTEXT consumers keep up with the producer and 
hence are unlikely to lose messages. Would it be a reasonable change to the 
test to ensure that there are at least two ISRs before killing the leader?

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033850#comment-15033850
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford] I dont see errors in my local replication test runs when run 
with PLAINTEXT with either new consumer or old consumer. But it could just be 
hiding timing issues because the consumer is faster. I will run the tests again 
tonight with the fix from KAFKA-2913. I am hopeful that once your the issues 
you are seeing are fixed, the replication tests would just work :-)

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033560#comment-15033560
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently.

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 




> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033694#comment-15033694
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford] Yes, you are right, replication test does set 
min.insync.replicas, ignore my previous comment.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033724#comment-15033724
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] so - in my investigations, even with min.insync.replicas = 2 + 
clean_shutdown additional pauses are needed between bounces to get long term 
stability on Ec2. My theory is this is a problem consumer-side because I don't 
see evidence of data loss in Kafka. Maybe by waiting for the ISR to hit 2 you 
are getting similar behaviour. Your test is a little more extreme though due to 
the hard_bounce.   

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-11-28 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030455#comment-15030455
 ] 

Ben Stopford commented on KAFKA-2891:
-

Sorry [~hachikuji]- typo - should have said "implies the problem should not be 
consumer side". now changed. 

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-11-27 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030082#comment-15030082
 ] 

Ben Stopford commented on KAFKA-2891:
-

One more bit of info - when this problem occurs the missing messages are not in 
the server data files. This implies the problem should be on the consumer side. 
However we don't seem to see this when the old consumer is used. 

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-11-27 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15030337#comment-15030337
 ] 

Jason Gustafson commented on KAFKA-2891:


[~benstopford] To be clear, are you saying that the message gap is on the 
server side? In other words, the messages were successfully acked by the 
producer, but were then lost? 

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-11-26 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15028771#comment-15028771
 ] 

Rajini Sivaram commented on KAFKA-2891:
---

[~benstopford] Thank you, it looks like the same problem as KAFKA-2827 in my 
test logs too. Will rerun the tests when that is fixed.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-11-25 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027301#comment-15027301
 ] 

Ben Stopford commented on KAFKA-2891:
-

So I'm starting to think the problem may be related to 
https://issues.apache.org/jira/browse/KAFKA-2827 (in my case at least). There 
are periods where the ISR drops to 1 which it shouldn't do during a clean 
bounce. Adding artificial pauses between node restarts also appears to remove 
the problem. Not definitive yet. Just a heads up.  

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-11-25 Thread Ben Stopford (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026907#comment-15026907
 ] 

Ben Stopford commented on KAFKA-2891:
-

Yes. I get exactly the same. Worked fine for about six runs then got a run with:

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set([29073, 29067, 29076, 29070, 29079])

Which i have not seen before (i.e. just a few messages missing). 


> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)