Build failed in Jenkins: Kafka-trunk #235
See https://builds.apache.org/job/Kafka-trunk/235/changes Changes: [junrao] kafka-1414; Speedup broker startup after hard reset; patched by Anton Karamanov; reviewed by Jay Kreps and Jun Rao -- [...truncated 475 lines...] org.apache.kafka.common.record.RecordTest testChecksum[62] PASSED org.apache.kafka.common.record.RecordTest testEquality[62] PASSED org.apache.kafka.common.record.RecordTest testFields[63] PASSED org.apache.kafka.common.record.RecordTest testChecksum[63] PASSED org.apache.kafka.common.record.RecordTest testEquality[63] PASSED org.apache.kafka.common.record.RecordTest testFields[64] PASSED org.apache.kafka.common.record.RecordTest testChecksum[64] PASSED org.apache.kafka.common.record.RecordTest testEquality[64] PASSED org.apache.kafka.common.record.RecordTest testFields[65] PASSED org.apache.kafka.common.record.RecordTest testChecksum[65] PASSED org.apache.kafka.common.record.RecordTest testEquality[65] PASSED org.apache.kafka.common.record.RecordTest testFields[66] PASSED org.apache.kafka.common.record.RecordTest testChecksum[66] PASSED org.apache.kafka.common.record.RecordTest testEquality[66] PASSED org.apache.kafka.common.record.RecordTest testFields[67] PASSED org.apache.kafka.common.record.RecordTest testChecksum[67] PASSED org.apache.kafka.common.record.RecordTest testEquality[67] PASSED org.apache.kafka.common.record.RecordTest testFields[68] PASSED org.apache.kafka.common.record.RecordTest testChecksum[68] PASSED org.apache.kafka.common.record.RecordTest testEquality[68] PASSED org.apache.kafka.common.record.RecordTest testFields[69] PASSED org.apache.kafka.common.record.RecordTest testChecksum[69] PASSED org.apache.kafka.common.record.RecordTest testEquality[69] PASSED org.apache.kafka.common.record.RecordTest testFields[70] PASSED org.apache.kafka.common.record.RecordTest testChecksum[70] PASSED org.apache.kafka.common.record.RecordTest testEquality[70] PASSED org.apache.kafka.common.record.RecordTest testFields[71] PASSED org.apache.kafka.common.record.RecordTest testChecksum[71] PASSED org.apache.kafka.common.record.RecordTest testEquality[71] PASSED org.apache.kafka.common.record.RecordTest testFields[72] PASSED org.apache.kafka.common.record.RecordTest testChecksum[72] PASSED org.apache.kafka.common.record.RecordTest testEquality[72] PASSED org.apache.kafka.common.record.RecordTest testFields[73] PASSED org.apache.kafka.common.record.RecordTest testChecksum[73] PASSED org.apache.kafka.common.record.RecordTest testEquality[73] PASSED org.apache.kafka.common.record.RecordTest testFields[74] PASSED org.apache.kafka.common.record.RecordTest testChecksum[74] PASSED org.apache.kafka.common.record.RecordTest testEquality[74] PASSED org.apache.kafka.common.record.RecordTest testFields[75] PASSED org.apache.kafka.common.record.RecordTest testChecksum[75] PASSED org.apache.kafka.common.record.RecordTest testEquality[75] PASSED org.apache.kafka.common.record.RecordTest testFields[76] PASSED org.apache.kafka.common.record.RecordTest testChecksum[76] PASSED org.apache.kafka.common.record.RecordTest testEquality[76] PASSED org.apache.kafka.common.record.RecordTest testFields[77] PASSED org.apache.kafka.common.record.RecordTest testChecksum[77] PASSED org.apache.kafka.common.record.RecordTest testEquality[77] PASSED org.apache.kafka.common.record.RecordTest testFields[78] PASSED org.apache.kafka.common.record.RecordTest testChecksum[78] PASSED org.apache.kafka.common.record.RecordTest testEquality[78] PASSED org.apache.kafka.common.record.RecordTest testFields[79] PASSED org.apache.kafka.common.record.RecordTest testChecksum[79] PASSED org.apache.kafka.common.record.RecordTest testEquality[79] PASSED org.apache.kafka.common.record.MemoryRecordsTest testIterator[0] PASSED org.apache.kafka.common.record.MemoryRecordsTest testIterator[1] PASSED org.apache.kafka.common.record.MemoryRecordsTest testIterator[2] PASSED org.apache.kafka.common.record.MemoryRecordsTest testIterator[3] PASSED org.apache.kafka.common.record.MemoryRecordsTest testIterator[4] PASSED org.apache.kafka.clients.NetworkClientTest testReadyAndDisconnect PASSED org.apache.kafka.clients.NetworkClientTest testSendToUnreadyNode PASSED org.apache.kafka.clients.NetworkClientTest testSimpleRequestResponse PASSED org.apache.kafka.clients.producer.MetadataTest testMetadata PASSED org.apache.kafka.clients.producer.MockProducerTest testAutoCompleteMock PASSED org.apache.kafka.clients.producer.MockProducerTest testManualCompletion PASSED org.apache.kafka.clients.producer.RecordAccumulatorTest testFull PASSED org.apache.kafka.clients.producer.RecordAccumulatorTest testAppendLarge PASSED org.apache.kafka.clients.producer.RecordAccumulatorTest testLinger PASSED org.apache.kafka.clients.producer.RecordAccumulatorTest testPartialDrain PASSED
[jira] [Commented] (KAFKA-1490) remove gradlew initial setup output from source distribution
[ https://issues.apache.org/jira/browse/KAFKA-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075969#comment-14075969 ] Stevo Slavic commented on KAFKA-1490: - SAMZA-283 solution doesn't make sense to me - why would one want to bootstrap gradle wrapper if one already has gradle installed? I've raised the issue on Gradle forum (see [here|http://forums.gradle.org/gradle/topics/bootstrap_gradle_wrapper_with_gradle_wrapper_scripts]). remove gradlew initial setup output from source distribution Key: KAFKA-1490 URL: https://issues.apache.org/jira/browse/KAFKA-1490 Project: Kafka Issue Type: Bug Reporter: Joe Stein Assignee: Ivan Lyutov Priority: Blocker Fix For: 0.8.2 Our current source releases contains lots of stuff in the gradle folder we do not need -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1559) Upgrade Gradle wrapper to Gradle 2.0
[ https://issues.apache.org/jira/browse/KAFKA-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075971#comment-14075971 ] Stevo Slavic commented on KAFKA-1559: - Added the link that this issue depends on resolving KAFKA-1490. Upgrade Gradle wrapper to Gradle 2.0 Key: KAFKA-1559 URL: https://issues.apache.org/jira/browse/KAFKA-1559 Project: Kafka Issue Type: Task Components: build Affects Versions: 0.8.1.1 Reporter: Stevo Slavic Priority: Trivial Labels: build Attachments: KAFKA-1559.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1533) transient unit test failure in ProducerFailureHandlingTest
[ https://issues.apache.org/jira/browse/KAFKA-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Corley updated KAFKA-1533: Attachment: kafka.threads Seeing the same issue. Jun, not sure if you attached your own thread dump or a copy of mine from the mailing list, but attaching here again per your request. transient unit test failure in ProducerFailureHandlingTest -- Key: KAFKA-1533 URL: https://issues.apache.org/jira/browse/KAFKA-1533 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.2 Reporter: Jun Rao Assignee: Guozhang Wang Fix For: 0.8.2 Attachments: KAFKA-1533.patch, KAFKA-1533.patch, KAFKA-1533.patch, KAFKA-1533_2014-07-21_15:45:58.patch, kafka.threads, stack.out Occasionally, saw the test hang on tear down. The following is the stack trace. Test worker prio=5 tid=7f9246956000 nid=0x10e078000 in Object.wait() [10e075000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 7f4e69578 (a org.apache.zookeeper.ClientCnxn$Packet) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1344) - locked 7f4e69578 (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:732) at org.I0Itec.zkclient.ZkConnection.delete(ZkConnection.java:91) at org.I0Itec.zkclient.ZkClient$8.call(ZkClient.java:720) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:716) at kafka.utils.ZkUtils$.deletePath(ZkUtils.scala:416) at kafka.utils.ZkUtils$.deregisterBrokerInZk(ZkUtils.scala:184) at kafka.server.KafkaHealthcheck.shutdown(KafkaHealthcheck.scala:50) at kafka.server.KafkaServer$$anonfun$shutdown$2.apply$mcV$sp(KafkaServer.scala:243) at kafka.utils.Utils$.swallow(Utils.scala:172) at kafka.utils.Logging$class.swallowWarn(Logging.scala:92) at kafka.utils.Utils$.swallowWarn(Utils.scala:45) at kafka.utils.Logging$class.swallow(Logging.scala:94) at kafka.utils.Utils$.swallow(Utils.scala:45) at kafka.server.KafkaServer.shutdown(KafkaServer.scala:243) at kafka.api.ProducerFailureHandlingTest.tearDown(ProducerFailureHandlingTest.scala:90) -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Gradle hanging on unit test run against ProducerFailureHangingTest-testBrokerFailure
Done On Sun, Jul 27, 2014 at 7:45 PM, Jun Rao jun...@gmail.com wrote: David, Apache mailing list doesn't seem to allow large attachments. Could you attach the stacktrace to the jira KAFKA-1533 (now reopened)? Thanks, Jun On Sun, Jul 27, 2014 at 11:21 AM, David Corley davidcor...@gmail.com wrote: Nope. It definitely sent. Are there some restrictions on mailing list attachments I wonder? I'll put it inline here: = 2014-07-25 18:40:15 Full thread dump Java HotSpot(TM) 64-Bit Server VM (20.65-b04-462 mixed mode): Attach Listener daemon prio=9 tid=7fcfb92a5000 nid=0x11a961000 waiting on condition [] java.lang.Thread.State: RUNNABLE kafka-scheduler-17 daemon prio=5 tid=7fcfbb80c000 nid=0x1387c3000 waiting on condition [1387c2000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 7f53b84d8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:196) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2025) at java.util.concurrent.DelayQueue.take(DelayQueue.java:164) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:609) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:602) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:917) at java.lang.Thread.run(Thread.java:695) ReplicaFetcherThread-0-0 prio=5 tid=7fcfbb80b000 nid=0x1385bd000 runnable [1385bb000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:136) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:69) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked 7f53b7578 (a sun.nio.ch.Util$2) - locked 7f53b7560 (a java.util.Collections$UnmodifiableSet) - locked 7f5400668 (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:193) - locked 7f53b7590 (a java.lang.Object) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) - locked 7f5405530 (a sun.nio.ch.SocketAdaptor$SocketInputStream) at java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:221) - locked 7f5407fc0 (a java.lang.Object) at kafka.utils.Utils$.read(Utils.scala:380) at kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54) at kafka.network.Receive$class.readCompletely(Transmission.scala:56) at kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29) at kafka.network.BlockingChannel.receive(BlockingChannel.scala:108) at kafka.consumer.SimpleConsumer.liftedTree1$1(SimpleConsumer.scala:71) at kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(SimpleConsumer.scala:68) - locked 7f5407ff0 (a java.lang.Object) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(SimpleConsumer.scala:112) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:112) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(SimpleConsumer.scala:112) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(SimpleConsumer.scala:111) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111) at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(SimpleConsumer.scala:111) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) at kafka.consumer.SimpleConsumer.fetch(SimpleConsumer.scala:110) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:96) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:88) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) Controller-0-to-broker-1-send-thread prio=5 tid=7fcfbb809800 nid=0x1384ba000 waiting on condition [1384b9000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 7f53eba88 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at
Re: Review Request 23962: Patch for KAFKA-1451
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23962/ --- (Updated July 28, 2014, 2:50 p.m.) Review request for kafka. Bugs: KAFKA-1451 https://issues.apache.org/jira/browse/KAFKA-1451 Repository: kafka Description (updated) --- Addresing Jun's comments Diffs (updated) - core/src/main/scala/kafka/server/ZookeeperLeaderElector.scala e5b6ff1e2544b043007cf16a6b9dd4451c839e63 Diff: https://reviews.apache.org/r/23962/diff/ Testing --- Thanks, Manikumar Reddy O
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076270#comment-14076270 ] Manikumar Reddy commented on KAFKA-1451: Updated reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:17:21.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar Reddy updated KAFKA-1451: --- Attachment: KAFKA-1451_2014-07-28_20:17:21.patch Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:17:21.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar Reddy updated KAFKA-1451: --- Attachment: (was: KAFKA-1451_2014-07-28_20:17:21.patch) Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 23983: Patch for KAFKA-1451
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23983/ --- Review request for kafka. Bugs: KAFKA-1451 https://issues.apache.org/jira/browse/KAFKA-1451 Repository: kafka Description --- controller existence check added in election process of ZookeeperLeaderElector Diffs - core/src/main/scala/kafka/server/ZookeeperLeaderElector.scala e5b6ff1e2544b043007cf16a6b9dd4451c839e63 Diff: https://reviews.apache.org/r/23983/diff/ Testing --- Thanks, Manikumar Reddy O
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076274#comment-14076274 ] Manikumar Reddy commented on KAFKA-1451: Created reviewboard https://reviews.apache.org/r/23983/diff/ against branch origin/trunk Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar Reddy updated KAFKA-1451: --- Attachment: KAFKA-1451.patch Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23962: Patch for KAFKA-1451
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23962/ --- (Updated July 28, 2014, 2:59 p.m.) Review request for kafka. Bugs: KAFKA-1451 https://issues.apache.org/jira/browse/KAFKA-1451 Repository: kafka Description (updated) --- Addresing Jun's Comments Diffs (updated) - core/src/main/scala/kafka/server/ZookeeperLeaderElector.scala e5b6ff1e2544b043007cf16a6b9dd4451c839e63 Diff: https://reviews.apache.org/r/23962/diff/ Testing --- Thanks, Manikumar Reddy O
[jira] [Updated] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar Reddy updated KAFKA-1451: --- Attachment: (was: KAFKA-1451.patch) Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar Reddy updated KAFKA-1451: --- Attachment: KAFKA-1451_2014-07-28_20:27:32.patch Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076277#comment-14076277 ] Manikumar Reddy commented on KAFKA-1451: Updated reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23895: Patch for KAFKA-1419
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23895/ --- (Updated July 28, 2014, 3:07 p.m.) Review request for kafka. Bugs: KAFKA-1419 https://issues.apache.org/jira/browse/KAFKA-1419 Repository: kafka Description (updated) --- KAFKA-1419 - cross build for scala 2.11 - dropped scala 2.8 support - minor bug fixes Diffs (updated) - build.gradle a72905df824ba68bed5d5170d18873c23e1782c9 core/src/main/scala/kafka/javaapi/message/ByteBufferMessageSet.scala fecee8d5f7b32f483bb1bfc6a5080d589906f9c4 core/src/main/scala/kafka/message/ByteBufferMessageSet.scala 73401c5ff34d08abce22267aa9c4d86632c6fb74 gradle.properties 4827769a3f8e34f0fe7e783eb58e44d4db04859b gradle/buildscript.gradle 225e0a82708bc5f390e5e2c1d4d9a0d06f491b95 gradle/wrapper/gradle-wrapper.properties 610282a699afc89a82203ef0e4e71ecc53761039 scala.gradle ebd21b870c0746aade63248344ab65d9b5baf820 Diff: https://reviews.apache.org/r/23895/diff/ Testing --- Thanks, Ivan Lyutov
[jira] [Commented] (KAFKA-1555) provide strong consistency with reasonable availability
[ https://issues.apache.org/jira/browse/KAFKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076286#comment-14076286 ] Jun Rao commented on KAFKA-1555: Hmm, I have to understand the semantics and the implementation of dfs.replication.min a bit more. For now, this seems more like the ack 1 case in Kafka. Basically, if you have R replicas and you set dfs.replication.min to a value smaller than R, you can get an ack back sooner. However, there is a slight possibility that the acked data is lost with fewer than R-1 failures. A couple of other thoughts on min.isr.required. 1. Let's say we do support this mode. However, immediately after a message is successfully published, some replicas could go down and bring the ISR below min.isr.required. There is still no guarantee that you have more than one copy of the data at that point. So, min.isr.required can only be enforced at the time of publishing. Is that what you want? 2. I image the implementation could be when a message is committed, we check the ISR size and return an error if |ISR| is less than min.isr.required. However, the message is still committed and will be exposed to the consumer. The error is just to tell the producer that the message may be under replicated. Does that match what you expect? provide strong consistency with reasonable availability --- Key: KAFKA-1555 URL: https://issues.apache.org/jira/browse/KAFKA-1555 Project: Kafka Issue Type: Improvement Components: controller Affects Versions: 0.8.1.1 Reporter: Jiang Wu Assignee: Neha Narkhede In a mission critical application, we expect a kafka cluster with 3 brokers can satisfy two requirements: 1. When 1 broker is down, no message loss or service blocking happens. 2. In worse cases such as two brokers are down, service can be blocked, but no message loss happens. We found that current kafka versoin (0.8.1.1) cannot achieve the requirements due to its three behaviors: 1. when choosing a new leader from 2 followers in ISR, the one with less messages may be chosen as the leader. 2. even when replica.lag.max.messages=0, a follower can stay in ISR when it has less messages than the leader. 3. ISR can contains only 1 broker, therefore acknowledged messages may be stored in only 1 broker. The following is an analytical proof. We consider a cluster with 3 brokers and a topic with 3 replicas, and assume that at the beginning, all 3 replicas, leader A, followers B and C, are in sync, i.e., they have the same messages and are all in ISR. According to the value of request.required.acks (acks for short), there are the following cases. 1. acks=0, 1, 3. Obviously these settings do not satisfy the requirement. 2. acks=2. Producer sends a message m. It's acknowledged by A and B. At this time, although C hasn't received m, C is still in ISR. If A is killed, C can be elected as the new leader, and consumers will miss m. 3. acks=-1. B and C restart and are removed from ISR. Producer sends a message m to A, and receives an acknowledgement. Disk failure happens in A before B and C replicate m. Message m is lost. In summary, any existing configuration cannot satisfy the requirements. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1419) cross build for scala 2.11
[ https://issues.apache.org/jira/browse/KAFKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Lyutov updated KAFKA-1419: --- Attachment: KAFKA-1419_2014-07-28_15:05:16.patch cross build for scala 2.11 -- Key: KAFKA-1419 URL: https://issues.apache.org/jira/browse/KAFKA-1419 Project: Kafka Issue Type: Improvement Components: clients Affects Versions: 0.8.1 Reporter: Scott Clasen Assignee: Ivan Lyutov Priority: Blocker Fix For: 0.8.2 Attachments: KAFKA-1419.patch, KAFKA-1419.patch, KAFKA-1419_2014-07-28_15:05:16.patch Please publish builds for scala 2.11, hopefully just needs a small tweak to the gradle conf? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1419) cross build for scala 2.11
[ https://issues.apache.org/jira/browse/KAFKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076285#comment-14076285 ] Ivan Lyutov commented on KAFKA-1419: Updated reviewboard https://reviews.apache.org/r/23895/diff/ against branch apache/trunk cross build for scala 2.11 -- Key: KAFKA-1419 URL: https://issues.apache.org/jira/browse/KAFKA-1419 Project: Kafka Issue Type: Improvement Components: clients Affects Versions: 0.8.1 Reporter: Scott Clasen Assignee: Ivan Lyutov Priority: Blocker Fix For: 0.8.2 Attachments: KAFKA-1419.patch, KAFKA-1419.patch, KAFKA-1419_2014-07-28_15:05:16.patch Please publish builds for scala 2.11, hopefully just needs a small tweak to the gradle conf? -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23648: Patch for KAFKA-1524
On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionContext.java, line 92 https://reviews.apache.org/r/23648/diff/1/?file=634391#file634391line92 What if the coordinator itself fails after a while and needs to be re-queried? Not sure if you intend for all of these to be covered in the subsequent work that addresses failure scenarios more thoroughly. Yes, I will revisit this once we check all failure scenarios, as there are other changes we may need to make. On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionContext.java, line 146 https://reviews.apache.org/r/23648/diff/1/?file=634391#file634391line146 I think it would be clearer to use a negated condition. i.e., if (txStatus != TransactionStatus.NOTRANSACTION) There are more than 2 states. With a negated condition we would not allow to start a transaction when the status is ABORTED, for example. On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionContext.java, line 158 https://reviews.apache.org/r/23648/diff/1/?file=634391#file634391line158 if (txStatus != TransactionStatus.ONGOING) Similar to previous. In this case we would be allowing to abort a committed transaction. On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionContext.java, line 170 https://reviews.apache.org/r/23648/diff/1/?file=634391#file634391line170 if (txStatus != TransactionStatus.ONGOING) In this case, repeating a commit command would throw an exception. I'm not sure if this is what we want, as there might be some failure scenarios where we want to be able to repeat commits. Not sure about this, maybe I should wait to have the failure handling done and then check this again? On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionContext.java, line 215 https://reviews.apache.org/r/23648/diff/1/?file=634391#file634391line215 Each check is a full iteration. Rather, can we just keep a count of pending messages for this check? Yes, the iteration also removes messages. The idea is to remove them lazily when it is necessary to check if there are more to send. The alternative seems more complex, as we would need to attach a listener to check and remove received messages. Also, by keeping the remove logic in the same method that counts, we avoid the need to synch. between listener and this thread, i.e. it seems easier to return the exact count. On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/NetworkClient.java, line 186 https://reviews.apache.org/r/23648/diff/1/?file=634383#file634383line186 One issue with this call is that it can block and add to the total time that poll needs to wait which breaks the API contract of poll (i.e., it should return in at most timeout ms). Can we fold these into the poll below? Also, (especially since the method is prefixed with) maybe... it would be cleaner to move the txContext null check inside the method. Regarding poll, do you mean putting the maybeUpdate... methods inside poll? This would require to pass at least two additional references. What if instead I measure the time to perform these maybeUpdate... and substrate from the remaining timeout for poll? I think this should be accurate enough On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/NetworkClient.java, line 463 https://reviews.apache.org/r/23648/diff/1/?file=634383#file634383line463 Offset commits are only needed to redo aborted transactions - i.e., typically in a failure scenario - so this is not really required for the scope of this jira right? Furthermore, we would want the partition of the offsets topic to be part of the transaction itself and the offset manager is not transaction aware either. Correct. I started sketching this and should have not included in this patch. I will provide a complete version once we handle failure scenarios. On July 22, 2014, 7:59 p.m., Joel Koshy wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/TransactionContext.java, line 218 https://reviews.apache.org/r/23648/diff/1/?file=634391#file634391line218 May need to unnecessarily wait. i.e., who interrupts? True. My intention was to check if all ACKs are received, otherwise wait a fraction of the total timeout. Instead of wait(remainingWaitMs) it would be something like wait(remainingWaitMs/4). - Raul --- This is an automatically generated e-mail. To reply, visit:
[jira] [Commented] (KAFKA-1542) normal IOException in the new producer is logged as ERROR
[ https://issues.apache.org/jira/browse/KAFKA-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076479#comment-14076479 ] Jun Rao commented on KAFKA-1542: This change introduced a bug. Saw the following when running ProducerFailureHandlingTest. The problem is that the remote InetAdress may not always be available (e.g., in connecting mode). Committed a followup patch. java.lang.NullPointerException at org.apache.kafka.common.network.Selector.poll(Selector.java:265) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:178) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115) at java.lang.Thread.run(Thread.java:695) normal IOException in the new producer is logged as ERROR - Key: KAFKA-1542 URL: https://issues.apache.org/jira/browse/KAFKA-1542 Project: Kafka Issue Type: Bug Affects Versions: 0.8.2 Reporter: Jun Rao Assignee: David Corley Labels: newbie Fix For: 0.8.2 Attachments: KAFKA-1542.patch Saw the following error in the log. It seems this can happen if the broker is down. So, this probably should be logged as WARN, instead ERROR. 2014/07/16 00:12:51.799 [Selector] Error in I/O: java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:60) at org.apache.kafka.common.network.Selector.poll(Selector.java:241) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:171) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:174) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:114) at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1533) transient unit test failure in ProducerFailureHandlingTest
[ https://issues.apache.org/jira/browse/KAFKA-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076484#comment-14076484 ] Jun Rao commented on KAFKA-1533: This seems to be introduced by a bug in KAFKA-1542. I just committed a fix. After that, I don't see the unit test hanging issue any more. David, Could you try the test again? transient unit test failure in ProducerFailureHandlingTest -- Key: KAFKA-1533 URL: https://issues.apache.org/jira/browse/KAFKA-1533 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.2 Reporter: Jun Rao Assignee: Guozhang Wang Fix For: 0.8.2 Attachments: KAFKA-1533.patch, KAFKA-1533.patch, KAFKA-1533.patch, KAFKA-1533_2014-07-21_15:45:58.patch, kafka.threads, stack.out Occasionally, saw the test hang on tear down. The following is the stack trace. Test worker prio=5 tid=7f9246956000 nid=0x10e078000 in Object.wait() [10e075000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 7f4e69578 (a org.apache.zookeeper.ClientCnxn$Packet) at java.lang.Object.wait(Object.java:485) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1344) - locked 7f4e69578 (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:732) at org.I0Itec.zkclient.ZkConnection.delete(ZkConnection.java:91) at org.I0Itec.zkclient.ZkClient$8.call(ZkClient.java:720) at org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:675) at org.I0Itec.zkclient.ZkClient.delete(ZkClient.java:716) at kafka.utils.ZkUtils$.deletePath(ZkUtils.scala:416) at kafka.utils.ZkUtils$.deregisterBrokerInZk(ZkUtils.scala:184) at kafka.server.KafkaHealthcheck.shutdown(KafkaHealthcheck.scala:50) at kafka.server.KafkaServer$$anonfun$shutdown$2.apply$mcV$sp(KafkaServer.scala:243) at kafka.utils.Utils$.swallow(Utils.scala:172) at kafka.utils.Logging$class.swallowWarn(Logging.scala:92) at kafka.utils.Utils$.swallowWarn(Utils.scala:45) at kafka.utils.Logging$class.swallow(Logging.scala:94) at kafka.utils.Utils$.swallow(Utils.scala:45) at kafka.server.KafkaServer.shutdown(KafkaServer.scala:243) at kafka.api.ProducerFailureHandlingTest.tearDown(ProducerFailureHandlingTest.scala:90) -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23767: Fix KAFKA-1430
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23767/ --- (Updated July 28, 2014, 6:30 p.m.) Review request for kafka. Bugs: KAFKA-1430 https://issues.apache.org/jira/browse/KAFKA-1430 Repository: kafka Description (updated) --- Address Jun's comments: 1. Kept the first comment about removing readlock on leaderReplicaIfLocal for further discussion; Kept the comment on whether satisfying a delayed fetch immediately if on partition has an error for further discussion 3. Rebased on KAFKA-1542 follow-up; Diffs (updated) - core/src/main/scala/kafka/api/FetchRequest.scala 55a5982c5234f3ac10d4b4ea9fd7c4aa11d34154 core/src/main/scala/kafka/api/FetchResponse.scala d117f10f724b09d6deef0df3a138d28fc91aa13a core/src/main/scala/kafka/cluster/Partition.scala 134aef9c88068443d4d465189f376dd78605b4f8 core/src/main/scala/kafka/cluster/Replica.scala 5e659b4a5c0256431aecc200a6b914472da9ecf3 core/src/main/scala/kafka/consumer/ConsumerFetcherThread.scala f8c1b4e674f7515c377c6c30d212130f1ff022dd core/src/main/scala/kafka/consumer/SimpleConsumer.scala 0e64632210385ef63c2ad3445b55ac4f37a63df2 core/src/main/scala/kafka/log/Log.scala b7bc5ffdecba7221b3d2e4bca2b0c703bc74fa12 core/src/main/scala/kafka/log/LogCleaner.scala afbeffc72e7d7706b44961aecf8519c5c5a3b4b1 core/src/main/scala/kafka/log/LogSegment.scala 0d6926ea105a99c9ff2cfc9ea6440f2f2d37bde8 core/src/main/scala/kafka/server/AbstractFetcherThread.scala 3b15254f32252cf824d7a292889ac7662d73ada1 core/src/main/scala/kafka/server/DelayedFetch.scala PRE-CREATION core/src/main/scala/kafka/server/DelayedProduce.scala PRE-CREATION core/src/main/scala/kafka/server/DelayedRequestKey.scala PRE-CREATION core/src/main/scala/kafka/server/FetchDataInfo.scala PRE-CREATION core/src/main/scala/kafka/server/FetchRequestPurgatory.scala PRE-CREATION core/src/main/scala/kafka/server/KafkaApis.scala fd5f12ee31e78cdcda8e24a0ab3e1740b3928884 core/src/main/scala/kafka/server/LogOffsetMetadata.scala PRE-CREATION core/src/main/scala/kafka/server/OffsetManager.scala 0e22897cd1c7e45c58a61c3c468883611b19116d core/src/main/scala/kafka/server/ProducerRequestPurgatory.scala PRE-CREATION core/src/main/scala/kafka/server/ReplicaFetcherThread.scala 75ae1e161769a020a102009df416009bd6710f4a core/src/main/scala/kafka/server/ReplicaManager.scala 897783cb756de548a8b634876f729b63ffe9925e core/src/main/scala/kafka/server/RequestPurgatory.scala 3d0ff1e2dbd6a5c3587cffa283db70415af83a7b core/src/main/scala/kafka/tools/MetadataRequestProducer.scala PRE-CREATION core/src/main/scala/kafka/tools/ReplicaVerificationTool.scala af4783646803e58714770c21f8c3352370f26854 core/src/main/scala/kafka/tools/TestEndToEndLatency.scala 5f8f6bc46ae6cbbe15cec596ac99d0b377d1d7ef core/src/test/scala/other/kafka/StressTestLog.scala 8fcd068b248688c40e73117dc119fa84cceb95b3 core/src/test/scala/unit/kafka/api/RequestResponseSerializationTest.scala cd16ced5465d098be7a60498326b2a98c248f343 core/src/test/scala/unit/kafka/integration/PrimitiveApiTest.scala 9f04bd38be639cde3e7f402845dbe6ae92e87dc2 core/src/test/scala/unit/kafka/log/LogManagerTest.scala 7d4c70ce651b1af45cf9bb69b974aa770de8e59d core/src/test/scala/unit/kafka/log/LogSegmentTest.scala 6b7603728ae5217565d68b92dd5349e7c6508f31 core/src/test/scala/unit/kafka/log/LogTest.scala 1da1393983d4b20330e7c7f374424edd1b26f2a3 core/src/test/scala/unit/kafka/message/BaseMessageSetTestCases.scala 6db245c956d2172cde916defdb0749081bf891fd core/src/test/scala/unit/kafka/server/HighwatermarkPersistenceTest.scala e532c2826c96bfa47d9a3c41b4d71dbf69541eac core/src/test/scala/unit/kafka/server/ISRExpirationTest.scala 2cd3a3faf7be2bbc22a892eec78c6e4c05545e18 core/src/test/scala/unit/kafka/server/LogRecoveryTest.scala 0ec120a4a953114e88c575dd6b583874371a09e3 core/src/test/scala/unit/kafka/server/RequestPurgatoryTest.scala 4f61f8469df99e02d6ce7aad897d10e158cca8fd core/src/test/scala/unit/kafka/server/SimpleFetchTest.scala b1c4ce9f66aa86c68f2e6988f67546fa63ed1f9b Diff: https://reviews.apache.org/r/23767/diff/ Testing --- Thanks, Guozhang Wang
[jira] [Commented] (KAFKA-1430) Purgatory redesign
[ https://issues.apache.org/jira/browse/KAFKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076550#comment-14076550 ] Guozhang Wang commented on KAFKA-1430: -- Updated reviewboard https://reviews.apache.org/r/23767/ against branch origin/trunk Purgatory redesign -- Key: KAFKA-1430 URL: https://issues.apache.org/jira/browse/KAFKA-1430 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 0.8.2 Reporter: Jun Rao Assignee: Guozhang Wang Attachments: KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430_2014-06-05_17:07:37.patch, KAFKA-1430_2014-06-05_17:13:13.patch, KAFKA-1430_2014-06-05_17:40:53.patch, KAFKA-1430_2014-06-10_11:22:06.patch, KAFKA-1430_2014-06-10_11:26:02.patch, KAFKA-1430_2014-07-11_10:59:13.patch, KAFKA-1430_2014-07-21_12:53:39.patch, KAFKA-1430_2014-07-25_09:52:43.patch, KAFKA-1430_2014-07-28_11:30:23.patch We have seen 2 main issues with the Purgatory. 1. There is no atomic checkAndWatch functionality. So, a client typically first checks whether a request is satisfied or not and then register the watcher. However, by the time the watcher is registered, the registered item could already be satisfied. This item won't be satisfied until the next update happens or the delayed time expires, which means the watched item could be delayed. 2. FetchRequestPurgatory doesn't quite work. This is because the current design tries to incrementally maintain the accumulated bytes ready for fetch. However, this is difficult since the right time to check whether a fetch (for regular consumer) request is satisfied is when the high watermark moves. At that point, it's hard to figure out how many bytes we should incrementally add to each pending fetch request. The problem has been reported in KAFKA-1150 and KAFKA-703. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1430) Purgatory redesign
[ https://issues.apache.org/jira/browse/KAFKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-1430: - Attachment: KAFKA-1430_2014-07-28_11:30:23.patch Purgatory redesign -- Key: KAFKA-1430 URL: https://issues.apache.org/jira/browse/KAFKA-1430 Project: Kafka Issue Type: Improvement Components: core Affects Versions: 0.8.2 Reporter: Jun Rao Assignee: Guozhang Wang Attachments: KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430.patch, KAFKA-1430_2014-06-05_17:07:37.patch, KAFKA-1430_2014-06-05_17:13:13.patch, KAFKA-1430_2014-06-05_17:40:53.patch, KAFKA-1430_2014-06-10_11:22:06.patch, KAFKA-1430_2014-06-10_11:26:02.patch, KAFKA-1430_2014-07-11_10:59:13.patch, KAFKA-1430_2014-07-21_12:53:39.patch, KAFKA-1430_2014-07-25_09:52:43.patch, KAFKA-1430_2014-07-28_11:30:23.patch We have seen 2 main issues with the Purgatory. 1. There is no atomic checkAndWatch functionality. So, a client typically first checks whether a request is satisfied or not and then register the watcher. However, by the time the watcher is registered, the registered item could already be satisfied. This item won't be satisfied until the next update happens or the delayed time expires, which means the watched item could be delayed. 2. FetchRequestPurgatory doesn't quite work. This is because the current design tries to incrementally maintain the accumulated bytes ready for fetch. However, this is difficult since the right time to check whether a fetch (for regular consumer) request is satisfied is when the high watermark moves. At that point, it's hard to figure out how many bytes we should incrementally add to each pending fetch request. The problem has been reported in KAFKA-1150 and KAFKA-703. -- This message was sent by Atlassian JIRA (v6.2#6252)
Jenkins build is back to normal : Kafka-trunk #236
See https://builds.apache.org/job/Kafka-trunk/236/changes
Re: Review Request 23895: Patch for KAFKA-1419
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23895/#review48877 --- Good stuff, I was able to get all binaries to build and pass tests with this patch. I think a fix to maintain compatibility with the BeanProperty change is important, but the other two aren't so much. build.gradle https://reviews.apache.org/r/23895/#comment85642 I think the file can be removed so that the exclusion isn't needed. core/src/main/scala/kafka/javaapi/message/ByteBufferMessageSet.scala https://reviews.apache.org/r/23895/#comment85644 you need to add `def getBuffer = buffer` here and below in order to maintain backwards compatibility (those are added by the @BeanProperty annotation). You could also do something similar to `kafka.utils.Annotations_2.9+.scala` and the groovy excludes to keep the BeanProperty annotation working. gradle.properties https://reviews.apache.org/r/23895/#comment85643 I hesitate to nit, but scala 2.11.2 was released 6 days ago. Might be worth switching if it doesn't cause any issues? - Joe Crobak On July 28, 2014, 3:07 p.m., Ivan Lyutov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23895/ --- (Updated July 28, 2014, 3:07 p.m.) Review request for kafka. Bugs: KAFKA-1419 https://issues.apache.org/jira/browse/KAFKA-1419 Repository: kafka Description --- KAFKA-1419 - cross build for scala 2.11 - dropped scala 2.8 support - minor bug fixes Diffs - build.gradle a72905df824ba68bed5d5170d18873c23e1782c9 core/src/main/scala/kafka/javaapi/message/ByteBufferMessageSet.scala fecee8d5f7b32f483bb1bfc6a5080d589906f9c4 core/src/main/scala/kafka/message/ByteBufferMessageSet.scala 73401c5ff34d08abce22267aa9c4d86632c6fb74 gradle.properties 4827769a3f8e34f0fe7e783eb58e44d4db04859b gradle/buildscript.gradle 225e0a82708bc5f390e5e2c1d4d9a0d06f491b95 gradle/wrapper/gradle-wrapper.properties 610282a699afc89a82203ef0e4e71ecc53761039 scala.gradle ebd21b870c0746aade63248344ab65d9b5baf820 Diff: https://reviews.apache.org/r/23895/diff/ Testing --- Thanks, Ivan Lyutov
[jira] Subscription: outstanding kafka patches
Issue Subscription Filter: outstanding kafka patches (111 issues) The list of outstanding kafka patches Subscriber: kafka-mailing-list Key Summary KAFKA-1559 Upgrade Gradle wrapper to Gradle 2.0 https://issues.apache.org/jira/browse/KAFKA-1559 KAFKA-1550 Patch review tool should use git format-patch to generate patch https://issues.apache.org/jira/browse/KAFKA-1550 KAFKA-1543 Changing replication factor https://issues.apache.org/jira/browse/KAFKA-1543 KAFKA-1541 Add transactional request definitions to clients package https://issues.apache.org/jira/browse/KAFKA-1541 KAFKA-1536 Change the status of the JIRA to Patch Available in the kafka-review-tool https://issues.apache.org/jira/browse/KAFKA-1536 KAFKA-1528 Normalize all the line endings https://issues.apache.org/jira/browse/KAFKA-1528 KAFKA-1527 SimpleConsumer should be transaction-aware https://issues.apache.org/jira/browse/KAFKA-1527 KAFKA-1526 Producer performance tool should have an option to enable transactions https://issues.apache.org/jira/browse/KAFKA-1526 KAFKA-1525 DumpLogSegments should print transaction IDs https://issues.apache.org/jira/browse/KAFKA-1525 KAFKA-1524 Implement transactional producer https://issues.apache.org/jira/browse/KAFKA-1524 KAFKA-1523 Implement transaction manager module https://issues.apache.org/jira/browse/KAFKA-1523 KAFKA-1522 Transactional messaging request/response definitions https://issues.apache.org/jira/browse/KAFKA-1522 KAFKA-1517 Messages is a required argument to Producer Performance Test https://issues.apache.org/jira/browse/KAFKA-1517 KAFKA-1510 Force offset commits when migrating consumer offsets from zookeeper to kafka https://issues.apache.org/jira/browse/KAFKA-1510 KAFKA-1509 Restart of destination broker after unreplicated partition move leaves partitions without leader https://issues.apache.org/jira/browse/KAFKA-1509 KAFKA-1507 Using GetOffsetShell against non-existent topic creates the topic unintentionally https://issues.apache.org/jira/browse/KAFKA-1507 KAFKA-1500 adding new consumer requests using the new protocol https://issues.apache.org/jira/browse/KAFKA-1500 KAFKA-1498 new producer performance and bug improvements https://issues.apache.org/jira/browse/KAFKA-1498 KAFKA-1496 Using batch message in sync producer only sends the first message if we use a Scala Stream as the argument https://issues.apache.org/jira/browse/KAFKA-1496 KAFKA-1481 Stop using dashes AND underscores as separators in MBean names https://issues.apache.org/jira/browse/KAFKA-1481 KAFKA-1477 add authentication layer and initial JKS x509 implementation for brokers, producers and consumer for network communication https://issues.apache.org/jira/browse/KAFKA-1477 KAFKA-1476 Get a list of consumer groups https://issues.apache.org/jira/browse/KAFKA-1476 KAFKA-1475 Kafka consumer stops LeaderFinder/FetcherThreads, but application does not know https://issues.apache.org/jira/browse/KAFKA-1475 KAFKA-1471 Add Producer Unit Tests for LZ4 and LZ4HC compression https://issues.apache.org/jira/browse/KAFKA-1471 KAFKA-1468 Improve perf tests https://issues.apache.org/jira/browse/KAFKA-1468 KAFKA-1460 NoReplicaOnlineException: No replica for partition https://issues.apache.org/jira/browse/KAFKA-1460 KAFKA-1451 Broker stuck due to leader election race https://issues.apache.org/jira/browse/KAFKA-1451 KAFKA-1450 check invalid leader in a more robust way https://issues.apache.org/jira/browse/KAFKA-1450 KAFKA-1430 Purgatory redesign https://issues.apache.org/jira/browse/KAFKA-1430 KAFKA-1419 cross build for scala 2.11 https://issues.apache.org/jira/browse/KAFKA-1419 KAFKA-1394 Ensure last segment isn't deleted on expiration when there are unflushed messages https://issues.apache.org/jira/browse/KAFKA-1394 KAFKA-1372 Upgrade to Gradle 1.10 https://issues.apache.org/jira/browse/KAFKA-1372 KAFKA-1367 Broker topic metadata not kept in sync with ZooKeeper https://issues.apache.org/jira/browse/KAFKA-1367 KAFKA-1351 String.format is very expensive in Scala https://issues.apache.org/jira/browse/KAFKA-1351 KAFKA-1343 Kafka consumer iterator thread stalls https://issues.apache.org/jira/browse/KAFKA-1343 KAFKA-1330 Implement subscribe(TopicPartition...partitions) in the new consumer https://issues.apache.org/jira/browse/KAFKA-1330 KAFKA-1329 Add metadata fetch and refresh functionality to the consumer https://issues.apache.org/jira/browse/KAFKA-1329 KAFKA-1324 Debian packaging https://issues.apache.org/jira/browse/KAFKA-1324 KAFKA-1303 metadata
[jira] [Commented] (KAFKA-1414) Speedup broker startup after hard reset
[ https://issues.apache.org/jira/browse/KAFKA-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076660#comment-14076660 ] ASF GitHub Bot commented on KAFKA-1414: --- Github user ataraxer closed the pull request at: https://github.com/apache/kafka/pull/26 Speedup broker startup after hard reset --- Key: KAFKA-1414 URL: https://issues.apache.org/jira/browse/KAFKA-1414 Project: Kafka Issue Type: Improvement Components: log Affects Versions: 0.8.1.1, 0.8.2, 0.9.0 Reporter: Dmitry Bugaychenko Assignee: Anton Karamanov Fix For: 0.8.2 Attachments: 0001-KAFKA-1414-Speedup-broker-startup-after-hard-reset-a.patch, KAFKA-1414-rev1.patch, KAFKA-1414-rev2.fixed.patch, KAFKA-1414-rev2.patch, KAFKA-1414-rev3-interleaving.patch, KAFKA-1414-rev3.patch, KAFKA-1414-rev4.patch, KAFKA-1414-rev5.patch, freebie.patch, parallel-dir-loading-0.8.patch, parallel-dir-loading-trunk-fixed-threadpool.patch, parallel-dir-loading-trunk-threadpool.patch, parallel-dir-loading-trunk.patch After hard reset due to power failure broker takes way too much time recovering unflushed segments in a single thread. This could be easiliy improved launching multiple threads (one per data dirrectory, assuming that typically each data directory is on a dedicated drive). Localy we trie this simple patch to LogManager.loadLogs and it seems to work, however I'm too new to scala, so do not take it literally: {code} /** * Recover and load all logs in the given data directories */ private def loadLogs(dirs: Seq[File]) { val threads : Array[Thread] = new Array[Thread](dirs.size) var i: Int = 0 val me = this for(dir - dirs) { val thread = new Thread( new Runnable { def run() { val recoveryPoints = me.recoveryPointCheckpoints(dir).read /* load the logs */ val subDirs = dir.listFiles() if(subDirs != null) { val cleanShutDownFile = new File(dir, Log.CleanShutdownFile) if(cleanShutDownFile.exists()) info(Found clean shutdown file. Skipping recovery for all logs in data directory '%s'.format(dir.getAbsolutePath)) for(dir - subDirs) { if(dir.isDirectory) { info(Loading log ' + dir.getName + ') val topicPartition = Log.parseTopicPartitionName(dir.getName) val config = topicConfigs.getOrElse(topicPartition.topic, defaultConfig) val log = new Log(dir, config, recoveryPoints.getOrElse(topicPartition, 0L), scheduler, time) val previous = addLogWithLock(topicPartition, log) if(previous != null) throw new IllegalArgumentException(Duplicate log directories found: %s, %s!.format(log.dir.getAbsolutePath, previous.dir.getAbsolutePath)) } } cleanShutDownFile.delete() } } }) thread.start() threads(i) = thread i = i + 1 } for(thread - threads) { thread.join() } } def addLogWithLock(topicPartition: TopicAndPartition, log: Log): Log = { logCreationOrDeletionLock synchronized { this.logs.put(topicPartition, log) } } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1555) provide strong consistency with reasonable availability
[ https://issues.apache.org/jira/browse/KAFKA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076758#comment-14076758 ] saurabh agarwal commented on KAFKA-1555: this seems more like the ack 1 case in Kafka. Basically, if you have R replicas and you set dfs.replication.min to a value smaller than R, you can get an ack back sooner. However, there is a slight possibility that the acked data is lost with fewer than R-1 failures Agree on the part that ack1 is similar to dfs.replication.min. It is actually also same as proposed min.isr.required. What I am suggesting is to pair with the ack =-1 which is similar to dfs.replication. In this case, when the system is running normal, all write will get acked by all replicas ( enforced by ack=-1). When replicas are going down one by one, the ack1 will ensure that there is min more of one replica. provide strong consistency with reasonable availability --- Key: KAFKA-1555 URL: https://issues.apache.org/jira/browse/KAFKA-1555 Project: Kafka Issue Type: Improvement Components: controller Affects Versions: 0.8.1.1 Reporter: Jiang Wu Assignee: Neha Narkhede In a mission critical application, we expect a kafka cluster with 3 brokers can satisfy two requirements: 1. When 1 broker is down, no message loss or service blocking happens. 2. In worse cases such as two brokers are down, service can be blocked, but no message loss happens. We found that current kafka versoin (0.8.1.1) cannot achieve the requirements due to its three behaviors: 1. when choosing a new leader from 2 followers in ISR, the one with less messages may be chosen as the leader. 2. even when replica.lag.max.messages=0, a follower can stay in ISR when it has less messages than the leader. 3. ISR can contains only 1 broker, therefore acknowledged messages may be stored in only 1 broker. The following is an analytical proof. We consider a cluster with 3 brokers and a topic with 3 replicas, and assume that at the beginning, all 3 replicas, leader A, followers B and C, are in sync, i.e., they have the same messages and are all in ISR. According to the value of request.required.acks (acks for short), there are the following cases. 1. acks=0, 1, 3. Obviously these settings do not satisfy the requirement. 2. acks=2. Producer sends a message m. It's acknowledged by A and B. At this time, although C hasn't received m, C is still in ISR. If A is killed, C can be elected as the new leader, and consumers will miss m. 3. acks=-1. B and C restart and are removed from ISR. Producer sends a message m to A, and receives an acknowledgement. Disk failure happens in A before B and C replicate m. Message m is lost. In summary, any existing configuration cannot satisfy the requirements. -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 24006: Patch for KAFKA-1420
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24006/ --- Review request for kafka. Bugs: KAFKA-1420 https://issues.apache.org/jira/browse/KAFKA-1420 Repository: kafka Description --- KAFKA-1420 Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in all unit tests Diffs - core/src/test/scala/unit/kafka/admin/AdminTest.scala e28979827110dfbbb92fe5b152e7f1cc973de400 core/src/test/scala/unit/kafka/admin/DeleteTopicTest.scala 29cc01bcef9cacd8dec1f5d662644fc6fe4994bc core/src/test/scala/unit/kafka/integration/UncleanLeaderElectionTest.scala f44568cb25edf25db857415119018fd4c9922f61 core/src/test/scala/unit/kafka/utils/TestUtils.scala c4e13c5240c8303853d08cc3b40088f8c7dae460 Diff: https://reviews.apache.org/r/24006/diff/ Testing --- Thanks, Jonathan Natkins
Review Request 24008: Patch for KAFKA-1420
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24008/ --- Review request for kafka. Bugs: KAFKA-1420 https://issues.apache.org/jira/browse/KAFKA-1420 Repository: kafka Description --- KAFKA-1420 Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in all unit tests Diffs - core/src/test/scala/unit/kafka/admin/AdminTest.scala e28979827110dfbbb92fe5b152e7f1cc973de400 core/src/test/scala/unit/kafka/admin/DeleteTopicTest.scala 29cc01bcef9cacd8dec1f5d662644fc6fe4994bc core/src/test/scala/unit/kafka/integration/UncleanLeaderElectionTest.scala f44568cb25edf25db857415119018fd4c9922f61 core/src/test/scala/unit/kafka/utils/TestUtils.scala c4e13c5240c8303853d08cc3b40088f8c7dae460 Diff: https://reviews.apache.org/r/24008/diff/ Testing --- Thanks, Jonathan Natkins
Re: Review Request 24006: Patch for KAFKA-1420
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24006/ --- (Updated July 28, 2014, 8:52 p.m.) Review request for kafka. Bugs: KAFKA-1420 https://issues.apache.org/jira/browse/KAFKA-1420 Repository: kafka Description --- KAFKA-1420 Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in all unit tests Diffs - core/src/test/scala/unit/kafka/admin/AdminTest.scala e28979827110dfbbb92fe5b152e7f1cc973de400 core/src/test/scala/unit/kafka/admin/DeleteTopicTest.scala 29cc01bcef9cacd8dec1f5d662644fc6fe4994bc core/src/test/scala/unit/kafka/integration/UncleanLeaderElectionTest.scala f44568cb25edf25db857415119018fd4c9922f61 core/src/test/scala/unit/kafka/utils/TestUtils.scala c4e13c5240c8303853d08cc3b40088f8c7dae460 Diff: https://reviews.apache.org/r/24006/diff/ Testing (updated) --- Automated Thanks, Jonathan Natkins
[jira] [Updated] (KAFKA-1420) Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests
[ https://issues.apache.org/jira/browse/KAFKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Natkins updated KAFKA-1420: Attachment: KAFKA-1420.patch Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests -- Key: KAFKA-1420 URL: https://issues.apache.org/jira/browse/KAFKA-1420 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Labels: newbie Fix For: 0.8.2 Attachments: KAFKA-1420.patch This is a follow-up JIRA from KAFKA-1389. There are a bunch of places in the unit tests where we misuse AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK to create topics, where TestUtils.createTopic needs to be used instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 24007: Patch for KAFKA-1420
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24007/ --- Review request for kafka. Bugs: KAFKA-1420 https://issues.apache.org/jira/browse/KAFKA-1420 Repository: kafka Description --- KAFKA-1420 Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in all unit tests Diffs - core/src/test/scala/unit/kafka/admin/AdminTest.scala e28979827110dfbbb92fe5b152e7f1cc973de400 core/src/test/scala/unit/kafka/admin/DeleteTopicTest.scala 29cc01bcef9cacd8dec1f5d662644fc6fe4994bc core/src/test/scala/unit/kafka/integration/UncleanLeaderElectionTest.scala f44568cb25edf25db857415119018fd4c9922f61 core/src/test/scala/unit/kafka/utils/TestUtils.scala c4e13c5240c8303853d08cc3b40088f8c7dae460 Diff: https://reviews.apache.org/r/24007/diff/ Testing --- Thanks, Jonathan Natkins
[jira] [Commented] (KAFKA-1420) Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests
[ https://issues.apache.org/jira/browse/KAFKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076838#comment-14076838 ] Guozhang Wang commented on KAFKA-1420: -- Thanks for the patch, will review it asap. Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests -- Key: KAFKA-1420 URL: https://issues.apache.org/jira/browse/KAFKA-1420 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Labels: newbie Fix For: 0.8.2 Attachments: KAFKA-1420.patch This is a follow-up JIRA from KAFKA-1389. There are a bunch of places in the unit tests where we misuse AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK to create topics, where TestUtils.createTopic needs to be used instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1420) Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests
[ https://issues.apache.org/jira/browse/KAFKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Natkins updated KAFKA-1420: Status: Patch Available (was: Open) Patch available at https://reviews.apache.org/r/24006/ Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests -- Key: KAFKA-1420 URL: https://issues.apache.org/jira/browse/KAFKA-1420 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Labels: newbie Fix For: 0.8.2 This is a follow-up JIRA from KAFKA-1389. There are a bunch of places in the unit tests where we misuse AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK to create topics, where TestUtils.createTopic needs to be used instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1420) Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests
[ https://issues.apache.org/jira/browse/KAFKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Natkins updated KAFKA-1420: Attachment: (was: KAFKA-1420.patch) Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests -- Key: KAFKA-1420 URL: https://issues.apache.org/jira/browse/KAFKA-1420 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Labels: newbie Fix For: 0.8.2 Attachments: KAFKA-1420.patch This is a follow-up JIRA from KAFKA-1389. There are a bunch of places in the unit tests where we misuse AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK to create topics, where TestUtils.createTopic needs to be used instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (KAFKA-1420) Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests
[ https://issues.apache.org/jira/browse/KAFKA-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Natkins updated KAFKA-1420: Attachment: KAFKA-1420.patch Replace AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK with TestUtils.createTopic in unit tests -- Key: KAFKA-1420 URL: https://issues.apache.org/jira/browse/KAFKA-1420 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Labels: newbie Fix For: 0.8.2 Attachments: KAFKA-1420.patch This is a follow-up JIRA from KAFKA-1389. There are a bunch of places in the unit tests where we misuse AdminUtils.createOrUpdateTopicPartitionAssignmentPathInZK to create topics, where TestUtils.createTopic needs to be used instead. -- This message was sent by Atlassian JIRA (v6.2#6252)
Review Request 24012: Patch for KAFKA-1560
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24012/ --- Review request for kafka. Bugs: KAFKA-1560 https://issues.apache.org/jira/browse/KAFKA-1560 Repository: kafka Description --- KAFKA-1560 Make arguments to jira-python API more explicit in kafka-patch-review's get_jira() Diffs - kafka-patch-review.py dc45549f886440f1721c60aab9aa0a4af9b4cbef Diff: https://reviews.apache.org/r/24012/diff/ Testing --- Thanks, Jonathan Natkins
[jira] [Updated] (KAFKA-1560) Make arguments to jira-python API more explicit in kafka-patch-review's get_jira()
[ https://issues.apache.org/jira/browse/KAFKA-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Natkins updated KAFKA-1560: Status: Patch Available (was: Open) Make arguments to jira-python API more explicit in kafka-patch-review's get_jira() --- Key: KAFKA-1560 URL: https://issues.apache.org/jira/browse/KAFKA-1560 Project: Kafka Issue Type: Improvement Reporter: Jonathan Natkins Priority: Minor Attachments: KAFKA-1560.patch I ran into an issue with kafka-patch-review.py as a result of a change in parameters for the JIRA() object between jira-python 0.28 and 0.29. A quick fix is to make parameters more explicit. As a side note, there is a separate bug affecting the review tool, but I'm not sure it's fixable without modifying the jira-python source (https://bitbucket.org/bspeakmon/jira-python/issue/122/clientissue-function-needs-param-removal) It seems the answer is that people should not use jira-python 0.29 yet, since it's a little broken. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (KAFKA-1560) Make arguments to jira-python API more explicit in kafka-patch-review's get_jira()
[ https://issues.apache.org/jira/browse/KAFKA-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076957#comment-14076957 ] Jonathan Natkins commented on KAFKA-1560: - Created reviewboard https://reviews.apache.org/r/24012/diff/ against branch origin/trunk Make arguments to jira-python API more explicit in kafka-patch-review's get_jira() --- Key: KAFKA-1560 URL: https://issues.apache.org/jira/browse/KAFKA-1560 Project: Kafka Issue Type: Improvement Reporter: Jonathan Natkins Priority: Minor Attachments: KAFKA-1560.patch I ran into an issue with kafka-patch-review.py as a result of a change in parameters for the JIRA() object between jira-python 0.28 and 0.29. A quick fix is to make parameters more explicit. As a side note, there is a separate bug affecting the review tool, but I'm not sure it's fixable without modifying the jira-python source (https://bitbucket.org/bspeakmon/jira-python/issue/122/clientissue-function-needs-param-removal) It seems the answer is that people should not use jira-python 0.29 yet, since it's a little broken. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23339: Patch for KAFKA-1507
On July 28, 2014, 3:16 a.m., Jun Rao wrote: clients/src/main/java/org/apache/kafka/clients/producer/internals/Metadata.java, lines 56-57 https://reviews.apache.org/r/23339/diff/3/?file=641061#file641061line56 We need to pass in the createTopic flag in this constructor too. In the producer, we will then set createTopic to true and the in the consumer, we will set it to true. I added another constructor public Metadata(long refreshBackoffMs, long metadataExpireMs, boolean createTopic) and kept this one as a backward compatibility if it causes confusion I'll remove this. I was thinking if the user is calling Metdata() from producer createTopic should be transparent to them as they expect to work since its part of broker config by default. - Sriharsha --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23339/#review48822 --- On July 24, 2014, 12:07 a.m., Sriharsha Chintalapani wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23339/ --- (Updated July 24, 2014, 12:07 a.m.) Review request for kafka. Bugs: KAFKA-1507 https://issues.apache.org/jira/browse/KAFKA-1507 Repository: kafka Description --- KAFKA-1507. Using GetOffsetShell against non-existent topic creates the topic unintentionally. Changes as per Jun's suggestions. Diffs - clients/src/main/java/org/apache/kafka/clients/NetworkClient.java d8f9ce663ee24d2b0852c974136741280c39f8f8 clients/src/main/java/org/apache/kafka/clients/producer/internals/Metadata.java 4aa5b01d611631db72df47d50bbe30edb8c478db clients/src/main/java/org/apache/kafka/common/protocol/Protocol.java 7517b879866fc5dad5f8d8ad30636da8bbe7784a clients/src/main/java/org/apache/kafka/common/protocol/types/Struct.java 444e69e7c95d5ffad19896fff0ab15cb4f5c9b4e clients/src/main/java/org/apache/kafka/common/requests/MetadataRequest.java b22ca1dce65f665d84c2a980fd82f816e93d9960 clients/src/test/java/org/apache/kafka/common/requests/RequestResponseTest.java df37fc6d8f0db0b8192a948426af603be3444da4 core/src/main/scala/kafka/api/TopicMetadataRequest.scala 7dca09ce637a40e125de05703dc42e8b611971ac core/src/main/scala/kafka/client/ClientUtils.scala ce7ede3f6d60e756e252257bd8c6fedc21f21e1c core/src/main/scala/kafka/consumer/ConsumerFetcherManager.scala b9e2bea7b442a19bcebd1b350d39541a8c9dd068 core/src/main/scala/kafka/javaapi/TopicMetadataRequest.scala b0b7be14d494ae8c87f4443b52db69d273c20316 core/src/main/scala/kafka/server/KafkaApis.scala fd5f12ee31e78cdcda8e24a0ab3e1740b3928884 core/src/main/scala/kafka/tools/GetOffsetShell.scala 9c6064e201eebbcd5b276a0dedd02937439edc94 core/src/main/scala/kafka/tools/ReplicaVerificationTool.scala af4783646803e58714770c21f8c3352370f26854 core/src/main/scala/kafka/tools/SimpleConsumerShell.scala 36314f412a8281aece2789fd2b74a106b82c57d2 core/src/test/scala/unit/kafka/admin/AddPartitionsTest.scala 1bf2667f47853585bc33ffb3e28256ec5f24ae84 core/src/test/scala/unit/kafka/integration/TopicMetadataTest.scala 35dc071b1056e775326981573c9618d8046e601d Diff: https://reviews.apache.org/r/23339/diff/ Testing --- Thanks, Sriharsha Chintalapani
[jira] [Created] (KAFKA-1561) Data Loss for Incremented Replica Factor and Leader Election
Guozhang Wang created KAFKA-1561: Summary: Data Loss for Incremented Replica Factor and Leader Election Key: KAFKA-1561 URL: https://issues.apache.org/jira/browse/KAFKA-1561 Project: Kafka Issue Type: Bug Reporter: Guozhang Wang Assignee: Guozhang Wang Fix For: 0.8.2 This is reported on the mailing list (thanks to Jad). quote Hi, I have a test that continuously sends messages to one broker, brings up another broker, and adds it as a replica for all partitions, with it being the preferred replica for some. I have auto.leader.rebalance.enable=true, so replica election gets triggered. Data is being pumped to the old broker all the while. It seems that some data gets lost while switching over to the new leader. Is this a bug, or do I have something misconfigured? I also have request.required.acks=-1 on the producer. Here's what I think is happening: 1. Producer writes message to broker 0, [EventServiceUpsertTopic,13], w/ broker 0 currently leader, with ISR=(0), so write returns successfully, even when acks = -1. Correlation id 35836 Producer log: [2014-07-24 14:44:26,991] [DEBUG] [dw-97 - PATCH /v1/events/type_for_test_bringupNewBroker_shouldRebalance_shouldNotLoseData/event?_idPath=idField_mergeFields=field1] [kafka.producer.BrokerPartitionInfo] Partition [EventServiceUpsertTopic,13] has leader 0 [2014-07-24 14:44:26,993] [DEBUG] [dw-97 - PATCH /v1/events/type_for_test_bringupNewBroker_shouldRebalance_shouldNotLoseData/event?_idPath=idField_mergeFields=field1] [k.producer.async.DefaultEventHandler] Producer sent messages with correlation id 35836 for topics [EventServiceUpsertTopic,13] to broker 0 on localhost:56821 2. Broker 1 is still catching up Broker 0 Log: [2014-07-24 14:44:26,992] [DEBUG] [kafka-request-handler-3] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 0: Old hw for partition [EventServiceUpsertTopic,13] is 971. New hw is 971. All leo's are 975,971 [2014-07-24 14:44:26,992] [DEBUG] [kafka-request-handler-3] [kafka.server.KafkaApis] [KafkaApi-0] Produce to local log in 0 ms [2014-07-24 14:44:26,992] [DEBUG] [kafka-processor-56821-0] [kafka.request.logger] Completed request:Name: ProducerRequest; Version: 0; CorrelationId: 35836; ClientId: ; RequiredAcks: -1; AckTimeoutMs: 1 ms from client /127.0.0.1:57086 ;totalTime:0,requestQueueTime:0,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0 3. Leader election is triggered by the scheduler: Broker 0 Log: [2014-07-24 14:44:26,991] [INFO ] [kafka-scheduler-0] [k.c.PreferredReplicaPartitionLeaderSelector] [PreferredReplicaPartitionLeaderSelector]: Current leader 0 for partition [ EventServiceUpsertTopic,13] is not the preferred replica. Trigerring preferred replica leader election [2014-07-24 14:44:26,993] [DEBUG] [kafka-scheduler-0] [kafka.utils.ZkUtils$] Conditional update of path /brokers/topics/EventServiceUpsertTopic/partitions/13/state with value {controller_epoch:1,leader:1,version:1,leader_epoch:3,isr:[0,1]} and expected version 3 succeeded, returning the new version: 4 [2014-07-24 14:44:26,994] [DEBUG] [kafka-scheduler-0] [k.controller.PartitionStateMachine] [Partition state machine on Controller 0]: After leader election, leader cache is updated to Map(Snipped(Leader:1,ISR:0,1,LeaderEpoch:3,ControllerEpoch:1),EndSnip) [2014-07-24 14:44:26,994] [INFO ] [kafka-scheduler-0] [kafka.controller.KafkaController] [Controller 0]: Partition [ EventServiceUpsertTopic,13] completed preferred replica leader election. New leader is 1 4. Broker 1 is still behind, but it sets the high water mark to 971!!! Broker 1 Log: [2014-07-24 14:44:26,999] [INFO ] [kafka-request-handler-6] [kafka.server.ReplicaFetcherManager] [ReplicaFetcherManager on broker 1] Removed fetcher for partitions [EventServiceUpsertTopic,13] [2014-07-24 14:44:27,000] [DEBUG] [kafka-request-handler-6] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 1: Old hw for partition [EventServiceUpsertTopic,13] is 970. New hw is -1. All leo's are -1,971 [2014-07-24 14:44:27,098] [DEBUG] [kafka-request-handler-3] [kafka.server.KafkaApis] [KafkaApi-1] Maybe update partition HW due to fetch request: Name: FetchRequest; Version: 0; CorrelationId: 1; ClientId: ReplicaFetcherThread-0-1; ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [EventServiceUpsertTopic,13] - PartitionFetchInfo(971,1048576), Snipped [2014-07-24 14:44:27,098] [DEBUG] [kafka-request-handler-3] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 1: Recording follower 0 position 971 for partition [ EventServiceUpsertTopic,13]. [2014-07-24 14:44:27,100] [DEBUG] [kafka-request-handler-3] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 1: Highwatermark for partition [EventServiceUpsertTopic,13] updated to 971
[jira] [Updated] (KAFKA-1561) Data Loss for Incremented Replica Factor and Leader Election
[ https://issues.apache.org/jira/browse/KAFKA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guozhang Wang updated KAFKA-1561: - Description: This is reported on the mailing list (thanks to Jad). {quote} Hi, I have a test that continuously sends messages to one broker, brings up another broker, and adds it as a replica for all partitions, with it being the preferred replica for some. I have auto.leader.rebalance.enable=true, so replica election gets triggered. Data is being pumped to the old broker all the while. It seems that some data gets lost while switching over to the new leader. Is this a bug, or do I have something misconfigured? I also have request.required.acks=-1 on the producer. Here's what I think is happening: 1. Producer writes message to broker 0, [EventServiceUpsertTopic,13], w/ broker 0 currently leader, with ISR=(0), so write returns successfully, even when acks = -1. Correlation id 35836 Producer log: [2014-07-24 14:44:26,991] [DEBUG] [dw-97 - PATCH /v1/events/type_for_test_bringupNewBroker_shouldRebalance_shouldNotLoseData/event?_idPath=idField_mergeFields=field1] [kafka.producer.BrokerPartitionInfo] Partition [EventServiceUpsertTopic,13] has leader 0 [2014-07-24 14:44:26,993] [DEBUG] [dw-97 - PATCH /v1/events/type_for_test_bringupNewBroker_shouldRebalance_shouldNotLoseData/event?_idPath=idField_mergeFields=field1] [k.producer.async.DefaultEventHandler] Producer sent messages with correlation id 35836 for topics [EventServiceUpsertTopic,13] to broker 0 on localhost:56821 2. Broker 1 is still catching up Broker 0 Log: [2014-07-24 14:44:26,992] [DEBUG] [kafka-request-handler-3] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 0: Old hw for partition [EventServiceUpsertTopic,13] is 971. New hw is 971. All leo's are 975,971 [2014-07-24 14:44:26,992] [DEBUG] [kafka-request-handler-3] [kafka.server.KafkaApis] [KafkaApi-0] Produce to local log in 0 ms [2014-07-24 14:44:26,992] [DEBUG] [kafka-processor-56821-0] [kafka.request.logger] Completed request:Name: ProducerRequest; Version: 0; CorrelationId: 35836; ClientId: ; RequiredAcks: -1; AckTimeoutMs: 1 ms from client /127.0.0.1:57086 ;totalTime:0,requestQueueTime:0,localTime:0,remoteTime:0,responseQueueTime:0,sendTime:0 3. Leader election is triggered by the scheduler: Broker 0 Log: [2014-07-24 14:44:26,991] [INFO ] [kafka-scheduler-0] [k.c.PreferredReplicaPartitionLeaderSelector] [PreferredReplicaPartitionLeaderSelector]: Current leader 0 for partition [ EventServiceUpsertTopic,13] is not the preferred replica. Trigerring preferred replica leader election [2014-07-24 14:44:26,993] [DEBUG] [kafka-scheduler-0] [kafka.utils.ZkUtils$] Conditional update of path /brokers/topics/EventServiceUpsertTopic/partitions/13/state with value {controller_epoch:1,leader:1,version:1,leader_epoch:3,isr:[0,1]} and expected version 3 succeeded, returning the new version: 4 [2014-07-24 14:44:26,994] [DEBUG] [kafka-scheduler-0] [k.controller.PartitionStateMachine] [Partition state machine on Controller 0]: After leader election, leader cache is updated to Map(Snipped(Leader:1,ISR:0,1,LeaderEpoch:3,ControllerEpoch:1),EndSnip) [2014-07-24 14:44:26,994] [INFO ] [kafka-scheduler-0] [kafka.controller.KafkaController] [Controller 0]: Partition [ EventServiceUpsertTopic,13] completed preferred replica leader election. New leader is 1 4. Broker 1 is still behind, but it sets the high water mark to 971!!! Broker 1 Log: [2014-07-24 14:44:26,999] [INFO ] [kafka-request-handler-6] [kafka.server.ReplicaFetcherManager] [ReplicaFetcherManager on broker 1] Removed fetcher for partitions [EventServiceUpsertTopic,13] [2014-07-24 14:44:27,000] [DEBUG] [kafka-request-handler-6] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 1: Old hw for partition [EventServiceUpsertTopic,13] is 970. New hw is -1. All leo's are -1,971 [2014-07-24 14:44:27,098] [DEBUG] [kafka-request-handler-3] [kafka.server.KafkaApis] [KafkaApi-1] Maybe update partition HW due to fetch request: Name: FetchRequest; Version: 0; CorrelationId: 1; ClientId: ReplicaFetcherThread-0-1; ReplicaId: 0; MaxWait: 500 ms; MinBytes: 1 bytes; RequestInfo: [EventServiceUpsertTopic,13] - PartitionFetchInfo(971,1048576), Snipped [2014-07-24 14:44:27,098] [DEBUG] [kafka-request-handler-3] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 1: Recording follower 0 position 971 for partition [ EventServiceUpsertTopic,13]. [2014-07-24 14:44:27,100] [DEBUG] [kafka-request-handler-3] [kafka.cluster.Partition] Partition [EventServiceUpsertTopic,13] on broker 1: Highwatermark for partition [EventServiceUpsertTopic,13] updated to 971 5. Consumer is none the wiser. All data that was in offsets 972-975 doesn't show up! I tried this with 2 initial replicas, and adding a 3rd which is supposed to be the leader for some new partitions,
[jira] [Commented] (KAFKA-1507) Using GetOffsetShell against non-existent topic creates the topic unintentionally
[ https://issues.apache.org/jira/browse/KAFKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077182#comment-14077182 ] Jay Kreps commented on KAFKA-1507: -- I kind of feel this is not the right fix. I think the behavior we have is kind of indefensible. The history was that we previously had an auto-create topic that happened when a message was sent to a broker and that broker didn't host the topic. However when we moved to replication this broke for multiple reasons: the producer needed metadata about the topic to send the message, and not all brokers would have all partitions. However this auto-create behavior is very useful. So we kind of just grandfathered it in by having the metadata request have the same side effect. But this is really a silly indefensible behavior. There is no reason that asking for metadata should create topics! This would be like in a database if running DESCRIBE TABLE X would create table X for you if it didn't exist. This just confuses everyone, as it must have confused you. In any case the auto-creation behavior is very limited because there is no way to specify the number of partitions, the replication factor, or any topic-specific configuration. Rather than further enshrining this behavior behavior by starting to add topic creation options to the metadata request, I really think we should add a proper create_topic API and have producers use that. We could even make this a little more general and handle create, alter, and delete. This would give clients full control. Using GetOffsetShell against non-existent topic creates the topic unintentionally - Key: KAFKA-1507 URL: https://issues.apache.org/jira/browse/KAFKA-1507 Project: Kafka Issue Type: Bug Affects Versions: 0.8.1.1 Environment: centos Reporter: Luke Forehand Assignee: Sriharsha Chintalapani Priority: Minor Labels: newbie Attachments: KAFKA-1507.patch, KAFKA-1507_2014-07-22_10:27:45.patch, KAFKA-1507_2014-07-23_17:07:20.patch A typo in using GetOffsetShell command can cause a topic to be created which cannot be deleted (because deletion is still in progress) ./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list kafka10:9092,kafka11:9092,kafka12:9092,kafka13:9092 --topic typo --time 1 ./kafka-topics.sh --zookeeper stormqa1/kafka-prod --describe --topic typo Topic:typo PartitionCount:8ReplicationFactor:1 Configs: Topic: typo Partition: 0Leader: 10 Replicas: 10 Isr: 10 ... -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23895: Patch for KAFKA-1419
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23895/#review48957 --- build.gradle https://reviews.apache.org/r/23895/#comment85795 I think these two dependencies need to be updated. As I was trying out this code, I got some strange compile errors [1] in a downstream project. This seems to work better: + compile 'org.scala-lang.modules:scala-xml_2.11:1.0.2' + compile 'org.scala-lang.modules:scala-parser-combinators_2.11:1.0.2' That error also makes me wonder if those compile dependencies are actually needed, too. [1] java.lang.RuntimeException: Conflicting cross-version suffixes in: org.scala-lang.modules:scala-xml, org.scala-lang.modules:scala-parser-combinators - Joe Crobak On July 28, 2014, 3:07 p.m., Ivan Lyutov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23895/ --- (Updated July 28, 2014, 3:07 p.m.) Review request for kafka. Bugs: KAFKA-1419 https://issues.apache.org/jira/browse/KAFKA-1419 Repository: kafka Description --- KAFKA-1419 - cross build for scala 2.11 - dropped scala 2.8 support - minor bug fixes Diffs - build.gradle a72905df824ba68bed5d5170d18873c23e1782c9 core/src/main/scala/kafka/javaapi/message/ByteBufferMessageSet.scala fecee8d5f7b32f483bb1bfc6a5080d589906f9c4 core/src/main/scala/kafka/message/ByteBufferMessageSet.scala 73401c5ff34d08abce22267aa9c4d86632c6fb74 gradle.properties 4827769a3f8e34f0fe7e783eb58e44d4db04859b gradle/buildscript.gradle 225e0a82708bc5f390e5e2c1d4d9a0d06f491b95 gradle/wrapper/gradle-wrapper.properties 610282a699afc89a82203ef0e4e71ecc53761039 scala.gradle ebd21b870c0746aade63248344ab65d9b5baf820 Diff: https://reviews.apache.org/r/23895/diff/ Testing --- Thanks, Ivan Lyutov
[jira] [Commented] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077385#comment-14077385 ] Manikumar Reddy commented on KAFKA-1451: Updated reviewboard https://reviews.apache.org/r/23962/diff/ against branch origin/trunk Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, KAFKA-1451_2014-07-29_10:13:23.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Review Request 23962: Patch for KAFKA-1451
On July 29, 2014, 4:08 a.m., Jun Rao wrote: core/src/main/scala/kafka/server/ZookeeperLeaderElector.scala, lines 71-72 https://reviews.apache.org/r/23962/diff/3/?file=643587#file643587line71 I think we need to first update leaderId here based on the value returned from getControllerID(), and then check if the controller exists. Otherwise, if the another broker is the controller, this broker's leaderId won't be updated correctly. Initially I also thought of updating leaderId field. But we are using leaderId field in amILeader method only. I thought its OK to not to update leaderId field. Anyhow.. i will upload a new patch with correct leaderId update. - Manikumar Reddy --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23962/#review48965 --- On July 29, 2014, 4:45 a.m., Manikumar Reddy O wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/23962/ --- (Updated July 29, 2014, 4:45 a.m.) Review request for kafka. Bugs: KAFKA-1451 https://issues.apache.org/jira/browse/KAFKA-1451 Repository: kafka Description --- Addresing Jun's Comments Diffs - core/src/main/scala/kafka/server/ZookeeperLeaderElector.scala e5b6ff1e2544b043007cf16a6b9dd4451c839e63 Diff: https://reviews.apache.org/r/23962/diff/ Testing --- Thanks, Manikumar Reddy O
[jira] [Updated] (KAFKA-1451) Broker stuck due to leader election race
[ https://issues.apache.org/jira/browse/KAFKA-1451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manikumar Reddy updated KAFKA-1451: --- Attachment: KAFKA-1451_2014-07-29_10:13:23.patch Broker stuck due to leader election race - Key: KAFKA-1451 URL: https://issues.apache.org/jira/browse/KAFKA-1451 Project: Kafka Issue Type: Bug Components: core Affects Versions: 0.8.1.1 Reporter: Maciek Makowski Assignee: Manikumar Reddy Priority: Minor Labels: newbie Attachments: KAFKA-1451.patch, KAFKA-1451_2014-07-28_20:27:32.patch, KAFKA-1451_2014-07-29_10:13:23.patch h3. Symptoms The broker does not become available due to being stuck in an infinite loop while electing leader. This can be recognised by the following line being repeatedly written to server.log: {code} [2014-05-14 04:35:09,187] INFO I wrote this conflicted ephemeral node [{version:1,brokerid:1,timestamp:1400060079108}] at /controller a while back in a different session, hence I will backoff for this node to be deleted by Zookeeper and retry (kafka.utils.ZkUtils$) {code} h3. Steps to Reproduce In a single kafka 0.8.1.1 node, single zookeeper 3.4.6 (but will likely behave the same with the ZK version included in Kafka distribution) node setup: # start both zookeeper and kafka (in any order) # stop zookeeper # stop kafka # start kafka # start zookeeper h3. Likely Cause {{ZookeeperLeaderElector}} subscribes to data changes on startup, and then triggers an election. if the deletion of ephemeral {{/controller}} node associated with previous zookeeper session of the broker happens after subscription to changes in new session, election will be invoked twice, once from {{startup}} and once from {{handleDataDeleted}}: * {{startup}}: acquire {{controllerLock}} * {{startup}}: subscribe to data changes * zookeeper: delete {{/controller}} since the session that created it timed out * {{handleDataDeleted}}: {{/controller}} was deleted * {{handleDataDeleted}}: wait on {{controllerLock}} * {{startup}}: elect -- writes {{/controller}} * {{startup}}: release {{controllerLock}} * {{handleDataDeleted}}: acquire {{controllerLock}} * {{handleDataDeleted}}: elect -- attempts to write {{/controller}} and then gets into infinite loop as a result of conflict {{createEphemeralPathExpectConflictHandleZKBug}} assumes that the existing znode was written from different session, which is not true in this case; it was written from the same session. That adds to the confusion. h3. Suggested Fix In {{ZookeeperLeaderElector.startup}} first run {{elect}} and then subscribe to data changes. -- This message was sent by Atlassian JIRA (v6.2#6252)