[jira] [Updated] (KAFKA-764) Race Condition in Broker Registration after ZooKeeper disconnect

2017-06-13 Thread Robert P. Thille (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert P. Thille updated KAFKA-764:
---
Attachment: BPPF_2900-Broker_Logs.tbz2

I believe the issues started somewhere around the time of these log messages:
[2017-05-25 07:08:25,528] INFO [Controller 2]: Broker 2 resigned as the 
controller (kafka.controller.KafkaController)
[2017-05-25 07:09:02,522] INFO [Controller 2]: Broker 2 resigned as the 
controller (kafka.controller.KafkaController)


> Race Condition in Broker Registration after ZooKeeper disconnect
> 
>
> Key: KAFKA-764
> URL: https://issues.apache.org/jira/browse/KAFKA-764
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Bob Cotton
> Attachments: BPPF_2900-Broker_Logs.tbz2
>
>
> When running our ZooKeepers in VMware, occasionally all the keepers 
> simultaneously pause long enough for the Kafka clients to time out and then 
> the keepers simultaneously un-pause.
> When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper 
> comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of 
> itself and does not re-register the broker id node and the function call 
> succeeds. Then ZooKeeper figures out the broker disconnected from the keeper 
> and deletes the ephemeral node *after* allowing the consumer to read the data 
> in the /brokers/ids/x node.  The broker then goes on to register all the 
> topics, etc.  When consumers connect, they see topic nodes associated with 
> the broker but thy can't find the broker node to get connection information 
> for the broker, sending them into a rebalance loop until they reach 
> rebalance.retries.max and fail.
> This might also be a ZooKeeper issue, but the desired behavior for a 
> disconnect case might be, if the broker node is found to explicitly delete 
> and recreate it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-764) Race Condition in Broker Registration after ZooKeeper disconnect

2017-06-13 Thread Robert P. Thille (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048552#comment-16048552
 ] 

Robert P. Thille commented on KAFKA-764:


I believe we saw this issue, or something very similar.  
During a load test, we had a 3-node Kafka cluster which got into a confused 
state: 
Brokers 0 and 1 were happy and were listed in /brokers/ids/X in ZK, and Broker 
2 was connected to ZK, but not listed in /brokers/ids/2 and brokers 0 & 1 had 
no connections to broker 2. 
Broker 2 was happily accepting new messages produced to it for hours.  
Eventually, it did rejoin the cluster, but the published messages were lost as 
the 0 & 1 brokers seemingly outvoted broker 2 about the partitions.

> Race Condition in Broker Registration after ZooKeeper disconnect
> 
>
> Key: KAFKA-764
> URL: https://issues.apache.org/jira/browse/KAFKA-764
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.7.1
>Reporter: Bob Cotton
>
> When running our ZooKeepers in VMware, occasionally all the keepers 
> simultaneously pause long enough for the Kafka clients to time out and then 
> the keepers simultaneously un-pause.
> When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper 
> comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of 
> itself and does not re-register the broker id node and the function call 
> succeeds. Then ZooKeeper figures out the broker disconnected from the keeper 
> and deletes the ephemeral node *after* allowing the consumer to read the data 
> in the /brokers/ids/x node.  The broker then goes on to register all the 
> topics, etc.  When consumers connect, they see topic nodes associated with 
> the broker but thy can't find the broker node to get connection information 
> for the broker, sending them into a rebalance loop until they reach 
> rebalance.retries.max and fail.
> This might also be a ZooKeeper issue, but the desired behavior for a 
> disconnect case might be, if the broker node is found to explicitly delete 
> and recreate it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-877) Still getting kafka.common.NotLeaderForPartitionException

2017-05-18 Thread Robert P. Thille (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016655#comment-16016655
 ] 

Robert P. Thille edited comment on KAFKA-877 at 5/18/17 11:32 PM:
--

I have logs from a 3-node cluster which got into a bad state after ZK expired:
{noformat}
[2017-05-09 18:33:42,897] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 05:06:13,469] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 16:33:43,349] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 16:33:44,059] INFO [Controller 2]: Broker 2 starting become 
controller state transition (kafka.controller.KafkaController)
{noformat}
The first two ZK session expirations were handled fine. The 3rd blew everything 
up.


was (Author: rthille):
I have logs from a 3-node cluster which got into a bad state after ZK expired:
{noformat}
[2017-05-09 18:33:42,897] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 05:06:13,469] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 16:33:43,349] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 16:33:44,059] INFO [Controller 2]: Broker 2 starting become 
controller state transition (kafka.controller.KafkaController)
{noformat}

> Still getting kafka.common.NotLeaderForPartitionException
> -
>
> Key: KAFKA-877
> URL: https://issues.apache.org/jira/browse/KAFKA-877
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: DEV
>Reporter: BalajiSeshadri
>Priority: Blocker
> Attachments: KAFKA-816.jpg
>
>
> Using the below trunk and i still see error happening.Please let us know if 
> this can be fixed.
> https://github.com/apache/kafka.git
> [2013-04-25 16:47:08,924] WARN 
> [console-consumer-24019_MERD7-21964-1366930009136-8b7f9eb7-leader-finder-thread],
>  Failed to add fetcher for [mytopic,0] to broker 
> id:0,host:MERD7-21964.echostar.com,port:9092 
> (kafka.consumer.ConsumerFetcherManager$$anon$1)
> kafka.common.NotLeaderForPartitionException
> at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> at java.lang.Class.newInstance0(Class.java:372)
> at java.lang.Class.newInstance(Class.java:325)
> at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:72)
> at 
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:163)
> at 
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:61)
> at 
> kafka.server.AbstractFetcherThread.addPartition(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherManager.addFetcher(AbstractFetcherManager.scala:48)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1$$anonfun$doWork$3.apply(ConsumerFetcherManager.scala:79)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1$$anonfun$doWork$3.apply(ConsumerFetcherManager.scala:75)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at 
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1.doWork(ConsumerFetcherManager.scala:75)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> We are evaluating Kafka for our new messaging system and we had tough time 
> running in windows.
> We somehow managed to run 0.8 using cygwin but when we run the console 
> 

[jira] [Commented] (KAFKA-877) Still getting kafka.common.NotLeaderForPartitionException

2017-05-18 Thread Robert P. Thille (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016655#comment-16016655
 ] 

Robert P. Thille commented on KAFKA-877:


I have logs from a 3-node cluster which got into a bad state after ZK expired:
{noformat}
[2017-05-09 18:33:42,897] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 05:06:13,469] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 16:33:43,349] INFO [SessionExpirationListener on 2], ZK expired; 
shut down all controller components and try to re-elect 
(kafka.controller.KafkaController$SessionExpirationListener)
[2017-05-17 16:33:44,059] INFO [Controller 2]: Broker 2 starting become 
controller state transition (kafka.controller.KafkaController)
{noformat}

> Still getting kafka.common.NotLeaderForPartitionException
> -
>
> Key: KAFKA-877
> URL: https://issues.apache.org/jira/browse/KAFKA-877
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: DEV
>Reporter: BalajiSeshadri
>Priority: Blocker
> Attachments: KAFKA-816.jpg
>
>
> Using the below trunk and i still see error happening.Please let us know if 
> this can be fixed.
> https://github.com/apache/kafka.git
> [2013-04-25 16:47:08,924] WARN 
> [console-consumer-24019_MERD7-21964-1366930009136-8b7f9eb7-leader-finder-thread],
>  Failed to add fetcher for [mytopic,0] to broker 
> id:0,host:MERD7-21964.echostar.com,port:9092 
> (kafka.consumer.ConsumerFetcherManager$$anon$1)
> kafka.common.NotLeaderForPartitionException
> at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> at java.lang.Class.newInstance0(Class.java:372)
> at java.lang.Class.newInstance(Class.java:325)
> at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:72)
> at 
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:163)
> at 
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:61)
> at 
> kafka.server.AbstractFetcherThread.addPartition(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherManager.addFetcher(AbstractFetcherManager.scala:48)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1$$anonfun$doWork$3.apply(ConsumerFetcherManager.scala:79)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1$$anonfun$doWork$3.apply(ConsumerFetcherManager.scala:75)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at 
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1.doWork(ConsumerFetcherManager.scala:75)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> We are evaluating Kafka for our new messaging system and we had tough time 
> running in windows.
> We somehow managed to run 0.8 using cygwin but when we run the console 
> producer/consumer,we are not getting messages from consumer.
> Please help us to fix this issue,this might not be related but its keeping on 
> throwing this error on consumer side. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-877) Still getting kafka.common.NotLeaderForPartitionException

2017-05-18 Thread Robert P. Thille (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16016619#comment-16016619
 ] 

Robert P. Thille commented on KAFKA-877:


We see this in Kafka 0.8.2.1 when our systems get overloaded and either 
ZooKeeper gets stalled trying to fsync, or Kafka gets starved and loses its 
connection to ZK.  It takes a restart of the brokers to get them properly in 
sync thereafter.

> Still getting kafka.common.NotLeaderForPartitionException
> -
>
> Key: KAFKA-877
> URL: https://issues.apache.org/jira/browse/KAFKA-877
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.8.0
> Environment: DEV
>Reporter: BalajiSeshadri
>Priority: Blocker
> Attachments: KAFKA-816.jpg
>
>
> Using the below trunk and i still see error happening.Please let us know if 
> this can be fixed.
> https://github.com/apache/kafka.git
> [2013-04-25 16:47:08,924] WARN 
> [console-consumer-24019_MERD7-21964-1366930009136-8b7f9eb7-leader-finder-thread],
>  Failed to add fetcher for [mytopic,0] to broker 
> id:0,host:MERD7-21964.echostar.com,port:9092 
> (kafka.consumer.ConsumerFetcherManager$$anon$1)
> kafka.common.NotLeaderForPartitionException
> at sun.reflect.GeneratedConstructorAccessor1.newInstance(Unknown 
> Source)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> at java.lang.Class.newInstance0(Class.java:372)
> at java.lang.Class.newInstance(Class.java:325)
> at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:72)
> at 
> kafka.consumer.SimpleConsumer.earliestOrLatestOffset(SimpleConsumer.scala:163)
> at 
> kafka.consumer.ConsumerFetcherThread.handleOffsetOutOfRange(ConsumerFetcherThread.scala:61)
> at 
> kafka.server.AbstractFetcherThread.addPartition(AbstractFetcherThread.scala:167)
> at 
> kafka.server.AbstractFetcherManager.addFetcher(AbstractFetcherManager.scala:48)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1$$anonfun$doWork$3.apply(ConsumerFetcherManager.scala:79)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1$$anonfun$doWork$3.apply(ConsumerFetcherManager.scala:75)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:95)
> at scala.collection.Iterator$class.foreach(Iterator.scala:772)
> at 
> scala.collection.mutable.HashTable$$anon$1.foreach(HashTable.scala:157)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:190)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:45)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:95)
> at 
> kafka.consumer.ConsumerFetcherManager$$anon$1.doWork(ConsumerFetcherManager.scala:75)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:51)
> We are evaluating Kafka for our new messaging system and we had tough time 
> running in windows.
> We somehow managed to run 0.8 using cygwin but when we run the console 
> producer/consumer,we are not getting messages from consumer.
> Please help us to fix this issue,this might not be related but its keeping on 
> throwing this error on consumer side. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)