[jira] [Commented] (KAFKA-2293) IllegalFormatConversionException in Partition.scala

2015-06-22 Thread Aditya Auradkar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596298#comment-14596298
 ] 

Aditya Auradkar commented on KAFKA-2293:


[~junrao] Can you take a look at this minor fix?

 IllegalFormatConversionException in Partition.scala
 ---

 Key: KAFKA-2293
 URL: https://issues.apache.org/jira/browse/KAFKA-2293
 Project: Kafka
  Issue Type: Bug
Reporter: Aditya Auradkar
Assignee: Aditya Auradkar
 Attachments: KAFKA-2293.patch


 ERROR [KafkaApis] [kafka-request-handler-9] [kafka-server] [] [KafkaApi-306] 
 error when handling request Name: 
 java.util.IllegalFormatConversionException: d != 
 kafka.server.LogOffsetMetadata
 at 
 java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4302)
 at 
 java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2793)
 at java.util.Formatter$FormatSpecifier.print(Formatter.java:2747)
 at java.util.Formatter.format(Formatter.java:2520)
 at java.util.Formatter.format(Formatter.java:2455)
 at java.lang.String.format(String.java:2925)
 at 
 scala.collection.immutable.StringLike$class.format(StringLike.scala:266)
 at scala.collection.immutable.StringOps.format(StringOps.scala:31)
 at 
 kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:253)
 at 
 kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:791)
 at 
 kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:788)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:788)
 at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:433)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 34789: Patch for KAFKA-2168

2015-06-22 Thread Guozhang Wang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34789/#review88796
---

Ship it!


I did not review it thoroughly but the design looks clean to me. Great work!

I think we can check it in to unblock other JIRAs, and come back to it when 
necessary in the future for any follow-up work.


clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerWakeupException.java
 (lines 15 - 18)
https://reviews.apache.org/r/34789/#comment141379

We would like to have a serialVersionUID for any classes extending 
Serializable. You can take a look at ConfigException for example.


- Guozhang Wang


On June 19, 2015, 4:19 p.m., Jason Gustafson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34789/
 ---
 
 (Updated June 19, 2015, 4:19 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2168
 https://issues.apache.org/jira/browse/KAFKA-2168
 
 
 Repository: kafka
 
 
 Description
 ---
 
 KAFKA-2168; refactored callback handling to prevent unnecessary requests
 
 
 KAFKA-2168; address review comments
 
 
 KAFKA-2168; fix rebase error and checkstyle issue
 
 
 KAFKA-2168; address review comments and add docs
 
 
 KAFKA-2168; handle polling with timeout 0
 
 
 KAFKA-2168; timeout=0 means return immediately
 
 
 KAFKA-2168; address review comments
 
 
 KAFKA-2168; address more review comments
 
 
 KAFKA-2168; updated for review comments
 
 
 Diffs
 -
 
   clients/src/main/java/org/apache/kafka/clients/consumer/Consumer.java 
 8f587bc0705b65b3ef37c86e0c25bb43ab8803de 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerRecords.java 
 1ca75f83d3667f7d01da1ae2fd9488fb79562364 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerWakeupException.java
  PRE-CREATION 
   clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
 951c34c92710fc4b38d656e99d2a41255c60aeb7 
   clients/src/main/java/org/apache/kafka/clients/consumer/MockConsumer.java 
 f50da825756938c193d7f07bee953e000e2627d9 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/OffsetResetStrategy.java
  PRE-CREATION 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/Coordinator.java
  41cb9458f51875ac9418fce52f264b35adba92f4 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java
  56281ee15cc33dfc96ff64d5b1e596047c7132a4 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/Heartbeat.java
  e7cfaaad296fa6e325026a5eee1aaf9b9c0fe1fe 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestFuture.java
  PRE-CREATION 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
  cee75410127dd1b86c1156563003216d93a086b3 
   clients/src/main/java/org/apache/kafka/common/utils/Utils.java 
 f73eedb030987f018d8446bb1dcd98d19fa97331 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/MockConsumerTest.java 
 677edd385f35d4262342b567262c0b874876d25b 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/CoordinatorTest.java
  1454ab73df22cce028f41f74b970628829da4e9d 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
  419541011d652becf0cda7a5e62ce813cddb1732 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/HeartbeatTest.java
  ecc78cedf59a994fcf084fa7a458fe9ed5386b00 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
  e000cf8e10ebfacd6c9ee68d7b88ff8c157f73c6 
   clients/src/test/java/org/apache/kafka/common/utils/UtilsTest.java 
 2ebe3c21f611dc133a2dbb8c7dfb0845f8c21498 
 
 Diff: https://reviews.apache.org/r/34789/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Jason Gustafson
 




[jira] [Updated] (KAFKA-2288) Follow-up to KAFKA-2249 - reduce logging and testing

2015-06-22 Thread Gwen Shapira (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gwen Shapira updated KAFKA-2288:

Attachment: KAFKA-2288_2015-06-22_10:02:27.patch

 Follow-up to KAFKA-2249 - reduce logging and testing
 

 Key: KAFKA-2288
 URL: https://issues.apache.org/jira/browse/KAFKA-2288
 Project: Kafka
  Issue Type: Bug
Reporter: Gwen Shapira
Assignee: Gwen Shapira
 Attachments: KAFKA-2288.patch, KAFKA-2288_2015-06-22_10:02:27.patch


 As [~junrao] commented on KAFKA-2249, we have a needless test and we are 
 logging configuration for every single partition now. 
 Lets reduce the noise.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 35677: Patch for KAFKA-2288

2015-06-22 Thread Gwen Shapira

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35677/
---

(Updated June 22, 2015, 5:02 p.m.)


Review request for kafka.


Bugs: KAFKA-2288
https://issues.apache.org/jira/browse/KAFKA-2288


Repository: kafka


Description (updated)
---

removed logging of topic overrides


Diffs (updated)
-

  clients/src/main/java/org/apache/kafka/common/config/AbstractConfig.java 
bae528d31516679bed88ee61b408f209f185a8cc 
  core/src/main/scala/kafka/log/LogConfig.scala 
fc41132d2bf29439225ec581829eb479f98cc416 
  core/src/test/scala/unit/kafka/log/LogConfigTest.scala 
19dcb47f3f406b8d6c3668297450ab6b534e4471 
  core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
98a5b042a710d3c1064b0379db1d152efc9eabee 
  core/src/test/scala/unit/kafka/server/KafkaConfigTest.scala 
2428dbd7197a58cf4cad42ef82b385dab3a2b15e 

Diff: https://reviews.apache.org/r/35677/diff/


Testing
---


Thanks,

Gwen Shapira



[jira] [Commented] (KAFKA-2293) IllegalFormatConversionException in Partition.scala

2015-06-22 Thread Aditya A Auradkar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596296#comment-14596296
 ] 

Aditya A Auradkar commented on KAFKA-2293:
--

Created reviewboard https://reviews.apache.org/r/35734/diff/
 against branch origin/trunk

 IllegalFormatConversionException in Partition.scala
 ---

 Key: KAFKA-2293
 URL: https://issues.apache.org/jira/browse/KAFKA-2293
 Project: Kafka
  Issue Type: Bug
Reporter: Aditya Auradkar
Assignee: Aditya Auradkar
 Attachments: KAFKA-2293.patch


 ERROR [KafkaApis] [kafka-request-handler-9] [kafka-server] [] [KafkaApi-306] 
 error when handling request Name: 
 java.util.IllegalFormatConversionException: d != 
 kafka.server.LogOffsetMetadata
 at 
 java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4302)
 at 
 java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2793)
 at java.util.Formatter$FormatSpecifier.print(Formatter.java:2747)
 at java.util.Formatter.format(Formatter.java:2520)
 at java.util.Formatter.format(Formatter.java:2455)
 at java.lang.String.format(String.java:2925)
 at 
 scala.collection.immutable.StringLike$class.format(StringLike.scala:266)
 at scala.collection.immutable.StringOps.format(StringOps.scala:31)
 at 
 kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:253)
 at 
 kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:791)
 at 
 kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:788)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:788)
 at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:433)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2235) LogCleaner offset map overflow

2015-06-22 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-2235.

   Resolution: Fixed
Fix Version/s: 0.8.3
 Assignee: Ivan Simoneko  (was: Jay Kreps)

Thanks for patch v2. +1. Committed to trunk after changing the following 
statement from testing  to =.

  if (map.size + segmentSize = maxDesiredMapSize)


 LogCleaner offset map overflow
 --

 Key: KAFKA-2235
 URL: https://issues.apache.org/jira/browse/KAFKA-2235
 Project: Kafka
  Issue Type: Bug
  Components: core, log
Affects Versions: 0.8.1, 0.8.2.0
Reporter: Ivan Simoneko
Assignee: Ivan Simoneko
 Fix For: 0.8.3

 Attachments: KAFKA-2235_v1.patch, KAFKA-2235_v2.patch


 We've seen log cleaning generating an error for a topic with lots of small 
 messages. It seems that cleanup map overflow is possible if a log segment 
 contains more unique keys than empty slots in offsetMap. Check for baseOffset 
 and map utilization before processing segment seems to be not enough because 
 it doesn't take into account segment size (number of unique messages in the 
 segment).
 I suggest to estimate upper bound of keys in a segment as a number of 
 messages in the segment and compare it with the number of available slots in 
 the map (keeping in mind desired load factor). It should work in cases where 
 an empty map is capable to hold all the keys for a single segment. If even a 
 single segment no able to fit into an empty map cleanup process will still 
 fail. Probably there should be a limit on the log segment entries count?
 Here is the stack trace for this error:
 2015-05-19 16:52:48,758 ERROR [kafka-log-cleaner-thread-0] 
 kafka.log.LogCleaner - [kafka-log-cleaner-thread-0], Error due to
 java.lang.IllegalArgumentException: requirement failed: Attempt to add a new 
 entry to a full offset map.
at scala.Predef$.require(Predef.scala:233)
at kafka.log.SkimpyOffsetMap.put(OffsetMap.scala:79)
at 
 kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:543)
at 
 kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:538)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at kafka.message.MessageSet.foreach(MessageSet.scala:67)
at 
 kafka.log.Cleaner.kafka$log$Cleaner$$buildOffsetMapForSegment(LogCleaner.scala:538)
at 
 kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:515)
at 
 kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:512)
at scala.collection.immutable.Stream.foreach(Stream.scala:547)
at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:512)
at kafka.log.Cleaner.clean(LogCleaner.scala:307)
at 
 kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:221)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:199)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2293) IllegalFormatConversionException in Partition.scala

2015-06-22 Thread Aditya A Auradkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aditya A Auradkar updated KAFKA-2293:
-
Attachment: KAFKA-2293.patch

 IllegalFormatConversionException in Partition.scala
 ---

 Key: KAFKA-2293
 URL: https://issues.apache.org/jira/browse/KAFKA-2293
 Project: Kafka
  Issue Type: Bug
Reporter: Aditya Auradkar
Assignee: Aditya Auradkar
 Attachments: KAFKA-2293.patch


 ERROR [KafkaApis] [kafka-request-handler-9] [kafka-server] [] [KafkaApi-306] 
 error when handling request Name: 
 java.util.IllegalFormatConversionException: d != 
 kafka.server.LogOffsetMetadata
 at 
 java.util.Formatter$FormatSpecifier.failConversion(Formatter.java:4302)
 at 
 java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:2793)
 at java.util.Formatter$FormatSpecifier.print(Formatter.java:2747)
 at java.util.Formatter.format(Formatter.java:2520)
 at java.util.Formatter.format(Formatter.java:2455)
 at java.lang.String.format(String.java:2925)
 at 
 scala.collection.immutable.StringLike$class.format(StringLike.scala:266)
 at scala.collection.immutable.StringOps.format(StringOps.scala:31)
 at 
 kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:253)
 at 
 kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:791)
 at 
 kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:788)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:788)
 at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:433)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (KAFKA-2290) OffsetIndex should open RandomAccessFile consistently

2015-06-22 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao resolved KAFKA-2290.

   Resolution: Fixed
Fix Version/s: 0.8.3

Thanks for the patch. +1 and committed to trunk.

 OffsetIndex should open RandomAccessFile consistently
 -

 Key: KAFKA-2290
 URL: https://issues.apache.org/jira/browse/KAFKA-2290
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.9.0
Reporter: Jun Rao
Assignee: Chris Black
  Labels: newbie
 Fix For: 0.8.3

 Attachments: KAFKA-2290.patch


 We open RandomAccessFile in rw mode in the constructor, but in rws mode 
 in resize(). We should use rw in both cases since it's more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2276) Initial patch for KIP-25

2015-06-22 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596624#comment-14596624
 ] 

Geoffrey Anderson commented on KAFKA-2276:
--

Pull request is here: https://github.com/apache/kafka/pull/70

 Initial patch for KIP-25
 

 Key: KAFKA-2276
 URL: https://issues.apache.org/jira/browse/KAFKA-2276
 Project: Kafka
  Issue Type: Bug
Reporter: Geoffrey Anderson
Assignee: Geoffrey Anderson

 Submit initial patch for KIP-25 
 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-25+-+System+test+improvements)
 This patch should contain a few Service classes and a few tests which can 
 serve as examples 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [GitHub] kafka pull request: Kafka 2276

2015-06-22 Thread Geoffrey Anderson
Hi,

I'm pinging the dev list regarding KAFKA-2276 (KIP-25 initial patch) again
since it sounds like at least one person I spoke with did not see the
initial pull request.

Pull request: https://github.com/apache/kafka/pull/70/
JIRA: https://issues.apache.org/jira/browse/KAFKA-2276

Thanks!
Geoff


On Tue, Jun 16, 2015 at 2:50 PM, granders g...@git.apache.org wrote:

 GitHub user granders opened a pull request:

 https://github.com/apache/kafka/pull/70

 Kafka 2276

 Initial patch for KIP-25

 Note that to install ducktape, do *not* use pip to install ducktape.
 Instead:

 ```
 $ git clone g...@github.com:confluentinc/ducktape.git
 $ cd ducktape
 $ python setup.py install
 ```


 You can merge this pull request into a Git repository by running:

 $ git pull https://github.com/confluentinc/kafka KAFKA-2276

 Alternatively you can review and apply these changes as the patch at:

 https://github.com/apache/kafka/pull/70.patch

 To close this pull request, make a commit to your master/trunk branch
 with (at least) the following in the commit message:

 This closes #70

 
 commit 81e41562f3836e95e89e12f215c82b1b2d505381
 Author: Liquan Pei liquan...@gmail.com
 Date:   2015-04-24T01:32:54Z

 Bootstrap Kafka system tests

 commit f1914c3ba9b52d0f8db3989c8b031127b42ac59e
 Author: Liquan Pei liquan...@gmail.com
 Date:   2015-04-24T01:33:44Z

 Merge pull request #2 from confluentinc/system_tests

 Bootstrap Kafka system tests

 commit a2789885806f98dcd1fd58edc9a10a30e4bd314c
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-26T22:21:23Z

 fixed typos

 commit 07cd1c66a952ee29fc3c8e85464acb43a6981b8a
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-26T22:22:14Z

 Added simple producer which prints status of produced messages to
 stdout.

 commit da94b8cbe79e6634cc32fbe8f6deb25388923029
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-27T21:07:20Z

 Added number of messages option.

 commit 212b39a2d75027299fbb1b1008d463a82aab
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-27T22:35:06Z

 Added some metadata to producer output.

 commit 8b4b1f2aa9681632ef65aa92dfd3066cd7d62851
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-29T23:38:32Z

 Minor updates to VerboseProducer

 commit c0526fe44cea739519a0889ebe9ead01b406b365
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-01T02:27:15Z

 Updates per review comments.

 commit bc009f218e00241cbdd23931d01b52c442eef6b7
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-01T02:28:28Z

 Got rid of VerboseProducer in core (moved to clients)

 commit 475423bb642ac8f816e8080f891867a6362c17fa
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-01T04:05:09Z

 Convert class to string before adding to json object.

 commit 0a5de8e0590e3a8dce1a91769ad41497b5e07d17
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-02T22:46:52Z

 Fixed checkstyle errors. Changed name to VerifiableProducer. Added
 synchronization for thread safety on println statements.

 commit 9100417ce0717a71c822c5a279fe7858bfe7a7ee
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-03T19:50:11Z

 Updated command-line options for VerifiableProducer. Extracted
 throughput logic to make it reusable.

 commit 1228eefc4e52b58c214b3ad45feab36a475d5a66
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T01:09:14Z

 Renamed throttler

 commit 6842ed1ffad62a84df67a0f0b6a651a6df085d12
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T01:12:11Z

 left out a file from last commit

 commit d586fb0eb63409807c02f280fae786cec55fb348
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T01:22:34Z

 Updated comments to reflect that throttler is not message-specific

 commit a80a4282ba9a288edba7cdf409d31f01ebf3d458
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T20:47:21Z

 Added shell program for VerifiableProducer.

 commit 51a94fd6ece926bcdd864af353efcf4c4d1b8ad8
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T20:55:02Z

 Use argparse4j instead of joptsimple. ThroughputThrottler now has more
 intuitive behavior when targetThroughput is 0.

 commit 632be12d2384bfd1ed3b057913dfd363cab71726
 Author: Geoff grand...@gmail.com
 Date:   2015-06-04T22:22:44Z

 Merge pull request #3 from confluentinc/verbose-client

 Verbose client

 commit fc7c81c1f6cce497c19da34f7c452ee44800ab6d
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-11T01:01:39Z

 added setup.py

 commit 884b20e3a7ce7a94f22594782322e4366b51f7eb
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-11T01:02:11Z

 Moved a bunch of files to kafkatest directory

 commit 25a413d6ae938e9773eb2b20509760bab464
 Author: Geoff grand...@gmail.com
 Date:   2015-06-11T20:29:21Z

 Update aws-example-Vagrantfile.local

 commit 96533c3718a9285d78393fb453b951592c72a490

Re: Review Request 35655: Patch for KAFKA-2271

2015-06-22 Thread Ewen Cheslack-Postava

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35655/#review88877
---

Ship it!


LGTM. There were 9 question marks when 10 characters were requested, so the 
problem was probably just that a whitespace character at the start or end would 
get trimmed during AbstractConfig's parsing.

- Ewen Cheslack-Postava


On June 19, 2015, 4:48 p.m., Jason Gustafson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35655/
 ---
 
 (Updated June 19, 2015, 4:48 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2271
 https://issues.apache.org/jira/browse/KAFKA-2271
 
 
 Repository: kafka
 
 
 Description
 ---
 
 KAFKA-2271; fix minor test bugs
 
 
 Diffs
 -
 
   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
 98a5b042a710d3c1064b0379db1d152efc9eabee 
 
 Diff: https://reviews.apache.org/r/35655/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Jason Gustafson
 




Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

2015-06-22 Thread Ewen Cheslack-Postava
I'll respond to specific comments, but at the bottom of this email I've
included some comparisons with other connector frameworks and Kafka
import/export tools. This definitely isn't an exhaustive list, but
hopefully will clarify how I'm thinking about Copycat should live wrt these
other systems.

Since Jay replied with 2 essays as I was writing this up, there may be some
duplication. Sorry for the verbosity...

@Roshan - The main gist is that by designing a framework around Kafka, we
don't have to generalize in a way that loses important features. Of the
systems you mentioned, the ones that are fairly general and have lots of
connectors don't offer the parallelism or semantics that could be achieved
(e.g. Flume) and the ones that have these benefits are almost all highly
specific to just one or two systems (e.g. Camus). Since Kafka is
increasingly becoming a central hub for streaming data (and buffer for
batch systems), one *common* system for integrating all these pieces is
pretty compelling.
Import: Flume is just one of many similar systems designed around log
collection. See notes below, but one major point is that they generally
don't provide any sort of guaranteed delivery semantics.
Export: Same deal here, you either get good delivery semantics and
parallelism for one system or a lot of connectors with very limited
guarantees. Copycat is intended to make it very easy to write connectors
for a variety of systems, get good (configurable!) delivery semantics,
parallelism, and work for a wide variety of systems (e.g. both batch and
streaming).
YARN: My point isn't that YARN is bad, it's that tying to any particular
cluster manager severely limits the applicability of the tool. The goal is
to make Copycat agnostic to the cluster manager so it can run under Mesos,
YARN, etc.
Exactly once: You accomplish this in any system by managing offsets in the
destination system atomically with the data or through some kind of
deduplication. Jiangjie actually just gave a great talk about this issue at
a recent Kafka meetup, perhaps he can share some slides about it. When you
see all the details involved, you'll see why I think it might be nice to
have the framework help you manage the complexities of achieving different
delivery semantics ;)
Connector variety: Addressed above.

@Jiangjie -
1. Yes, the expectation is that most coding is in the connectors. Ideally
the framework doesn't need many changes after we get the basics up and
running. But I'm not sure I understand what you mean about a library vs.
static framework?
2. This depends on packaging. We should at least have a separate jar, just
as we now do with clients. It's true that the tar.gz downloads would
contain both, but that probably makes sense since you need Kafka to do any
local testing with Copycat anyway, which you presumably want to do before
running any production jobs.

@Gwen -
I agree that the community around a project is really important. Some of
the issues you mentioned -- committership and dependencies -- are
definitely important considerations. The community aspect can easily make
or break something like Copycat. I think this is something Kafka needs to
address anyway (committership in particular, since committers are currently
overloaded).

One immediate benefit of including it in the same community is that it
starts out with a great, supportive community. We'd get to leverage all the
great existing Kafka knowledge of the community. It also means Copycat
patches are more likely to be seen by Kafka devs that can give helpful
reviews. I'll definitely agree that there are some drawbacks too -- joining
the mailing lists might be a bit overwhelming if you only wanted help w/
Copycat :)

Another benefit, not to be overlooked, is that it avoids a bunch of extra
overhead. Incubating an entire separate Apache project adds a *lot* of
overhead.

I also want to mention that the KIP specifically mentions that Copycat
should use public Kafka APIs, but I don't think this means development of
both should be decoupled. In particular, the distributed version of Copycat
needs functionality that is very closely related to functionality that
already exists in Kafka, some of which is exposed via public protocols
(worker membership needs to be tracked like consumers, worker assignments
have similar needs to consumer topic-partition assignments, offset commits
in Copycat are similar to consumer offset commits). It's hard to say if any
of that can be directly reused, but if it could, it could pay off in
spades. Even if not, since there are so many similar issues involved, it'd
be worth it just to leverage previous experience. Even though Copycat
should be cleanly separated from the main Kafka code (just as the clients
are now cleanly separated from the broker), I think they can likely benefit
from careful co-evolution that is more difficult to achieve if they really
are separate communities.

On docs, you're right that we could address that issue just by adding a 

Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

2015-06-22 Thread Jay Kreps
Hey Gwen,

That makes a lot of sense. Here was the thinking on our side.

I guess there are two questions, where does Copycat go and where do the
connectors go?

I'm in favor of Copycat being in Kafka and the connectors being federated.

Arguments for federating connectors:
- There will be like  100 connectors so if we keep them all in the same
repo it will be a lot.
- These plugin apis are a fantastic area for open source contribution--well
defined, bite sized, immediately useful, etc.
- If I wrote connector A I'm not particularly qualified to review connector
B. These things require basic Kafka knowledge but mostly they're very
system specific. Putting them all in one project ends up being kind of a
mess.
- Many people will have in-house systems that require custom connectors
anyway.
- You can't centrally maintain all the connectors so you need in any case
need to solve the whole app store experience for connectors (Ewen laughs
at me every time I say app store for connectors). Once you do that it
makes sense to just use the mechanism for everything.
- Many vendors we've talked to want to be able to maintain their own
connector and release it with their system not release it with Kafka or
another third party project.
- There is definitely a role for testing and certification of the
connectors but it's probably not something the main project should take on.

Federation doesn't necessarily mean that there can only be one repository
for each connector. We have a single repo for the connectors we're building
at confluent just for simplicity. It just means that regardless of where
the connector is maintained it integrates as a first-class citizen.

Basically I think really nailing federated connectors is pretty central to
having a healthy connector ecosystem which is the primary thing for making
this work.

Okay now the question of whether the copycat apis/framework should be in
Kafka or be an external project. We debated this a lot internally.

I was on the pro-kafka-inclusion side so let me give that argument. I think
the apis for pulling data into Kafka or pushing into a third party system
are actually really a core thing to what Kafka is. Kafka currently provides
a push producer and pull consumer because those are the harder problems to
solve, but about half the time you need the opposite (a pull producer and
push consumer). It feels weird to include any new thing, but I actually
feel like these apis are super central and natural to include in Kafka (in
fact they are so natural many other system only have that style of API).

I think the key question is whether we can do a good job at designing these
apis. If we can then we should really have an official set of apis. Having
official Kafka apis that are documented as part of the main docs and are
part of each release will do a ton to help foster the connector ecosystem
because it will be kind of a default way of doing Kaka integration and all
the people building in-house from-scratch connectors will likely just use
it. If it is a separate project then it is a separate discovery and
adoption decision (this is somewhat irrational but totally true).

I think one assumption we are making is that the copycat framework won't be
huge. It should be a manageable chunk of code.

I agree with your description of the some of the cons of bundling. However
I think there are pros as well and some of them are quite important.

The biggest is that for some reasons things that are maintained and
documented together end up feeling and working like a single product. This
is sort of a fuzzy thing. But one complaint I have about the Hadoop
ecosystem (and it is one of the more amazing products of open source in the
history of the world, so forgive the criticism) is that it FEELs like a
loosely affiliated collection of independent things kind of bolted
together. Products that are more centralized can give a much more holistic
feel to usage (configuration, commands, monitoring, etc) and things that
aren't somehow always drift apart (maybe just because the committers are
different).

So I actually totally agree with what you said about Spark. And if we end
up trying to include a machine learning library or anything far afield I
think I would agree we would have exactly that problem.

But I think the argument I would make is that this is actually a gap in our
existing product, not a new product and so having that identity is
important.

-Jay

On Sun, Jun 21, 2015 at 9:24 PM, Gwen Shapira gshap...@cloudera.com wrote:

 Ah, I see this in rejected alternatives now. Sorry :)

 I actually prefer the idea of a separate project for framework +
 connectors over having the framework be part of Apache Kafka.

 Looking at nearby examples: Hadoop has created a wide ecosystem of
 projects, with Sqoop and Flume supplying connectors. Spark on the
 other hand keeps its subprojects as part of Apache Spark.

 When I look at both projects, I see that Flume and Sqoop created
 active communities (that 

[jira] [Commented] (KAFKA-2205) Generalize TopicConfigManager to handle multiple entity configs

2015-06-22 Thread Aditya Auradkar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596965#comment-14596965
 ] 

Aditya Auradkar commented on KAFKA-2205:


[~junrao] - Can you review this patch?

 Generalize TopicConfigManager to handle multiple entity configs
 ---

 Key: KAFKA-2205
 URL: https://issues.apache.org/jira/browse/KAFKA-2205
 Project: Kafka
  Issue Type: Sub-task
Reporter: Aditya Auradkar
Assignee: Aditya Auradkar
  Labels: quotas
 Attachments: KAFKA-2205.patch


 Acceptance Criteria:
 - TopicConfigManager should be generalized to handle Topic and Client configs 
 (and any type of config in the future). As described in KIP-21
 - Add a ConfigCommand tool to change topic and client configuration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

2015-06-22 Thread Jay Kreps
Hey Roshan,

That is definitely the key question in this space--what can we do that
other systems don't?

It's true that there are a number of systems that copy data between things.
At a high enough level of abstraction I suppose they are somewhat the same.
But I think this area is the source of rather a lot of pain for people
running these things so it is hard to imagine that the problem is totally
solved in the current state.

All the systems you mention are good, and a few we have even contributed to
so this is not to disparage anything.

Here are the advantages in what we are proposing:
1. Unlike sqoop and Camus this treats batch load as a special case of
continuous load (where the stream happens to be a bit bursty). I think this
is the right approach and enables real-time integration without giving up
the possibility of periodic dumps.
2.  We are trying to make it possible to capture and integrate the metadata
around schema with the data whenever possible. This is present and
something the connectors themselves have access to. I think this is a big
deal versus just delivering opaque byte[]/String rows, and is really
required for doing this kind of thing well at scale. This allows a lot of
simple filtering, projection, mapping, etc without custom code as well as
making it possible to start to have notions of compatibility and schema
evolution. We hope to make the byte[]/String case be kind of a special case
of the richer record model where you just have a simple schema.
3. This has a built in notion of parallelism throughout.
4. This maps well to Kafka. For people using Kafka I think basically
sharing a data model makes things a lot simpler (topics, partitions, etc).
This also makes it a lot easier to reason about guarantees.
5. Philosophically we are very committed to the idea of O(1) data loads,
which I think Gwen has more eloquently called the factory model, and in
other context's I have heard described as Cattle not Pets. The idea being
that if you accept up front that you are going to have ~1000 data streams
in a company and dozens of sources and syncs the approach you take towards
this sort of stuff is radically different than if you assume a few inputs,
one output and a dozen data streams. I think this plays out in a bunch of
ways around management, configuration, etc.

Ultimately I think one thing we learned in thinking about the area is that
the system you come up with really comes down to what assumptions you make.

To address a few of your other points:
- We agree running in YARN is a good thing, but requiring YARN is a bad
thing. I think you may be seeing things somewhat from a Hadoop-centric view
where YARN is much more prevalent. However I think the scope of the problem
is not at all specific to Hadoop and beyond the Hadoop ecosystem we don't
see that heavy use of YARN (Mesos is more prevalent, but neither is
particularly common). I think our approach here is that copycat runs as a
process, if you run it in YARN it should work in Slider, if you run it in
Mesos in Marathon, and if you run it with old fashioned ops tools then you
just manage it like any other process.
- Exactly-once: Yes, but when we add that support in Kafka you will get it
end-to-end, which is important.
- I agree that all existing systems have more connectors--we are willing to
do the work to catch up there as we think it is possible to get to an
overall better state. I definitely agree this is significant work.

-Jay




On Fri, Jun 19, 2015 at 7:57 PM, Roshan Naik ros...@hortonworks.com wrote:

 My initial thoughts:

 Although it is kind of discussed very broadly, I did struggle a bit to
 properly grasp the value add this adds over the alternative approaches that
 are available today (or need a little work to accomplish) in specific use
 cases. I feel its better to take  specific common use cases and show why
 this will do better to make it clear. For example data flow starting from a
 pool of web server and finally end up in HDFS or Hive while providing
 At-least one guarantees.

 Below are more specific points that occurred to me:

 - Import: Today we can create data flows to pick up data from a variety of
 source and push data into Kafka using Flume. Not clear how this system can
 do better in this specific case.
 - Export: For pulling data out of Kakfa there is Camus (which limits
 destination to HDFS), Flume (which can deliver to many places) and also
 Sqoop (which could be extended to support Kafka). Camus and Sqoop don't
 have the problem of requires defining many tasks issue for parallelism.
 - YARN support – Letting YARN manage things  is actually good thing (not a
 bad thing as indicated), since its easier for the scaling in/out as needed
 and not worry too much about hardware allocation.
 - Exactly-Once:  It is clear that on the import side you won't support
 that for now. Not clear how you will support that on export side for
 destination like HDFS or some other. Exactly once only make sense when we
 can 

Re: Review Request 34789: Patch for KAFKA-2168

2015-06-22 Thread Jason Gustafson

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34789/
---

(Updated June 22, 2015, 11:35 p.m.)


Review request for kafka.


Bugs: KAFKA-2168
https://issues.apache.org/jira/browse/KAFKA-2168


Repository: kafka


Description (updated)
---

KAFKA-2168; refactored callback handling to prevent unnecessary requests


KAFKA-2168; address review comments


KAFKA-2168; fix rebase error and checkstyle issue


KAFKA-2168; address review comments and add docs


KAFKA-2168; handle polling with timeout 0


KAFKA-2168; timeout=0 means return immediately


KAFKA-2168; address review comments


KAFKA-2168; address more review comments


KAFKA-2168; updated for review comments


KAFKA-2168; add serialVersionUID to ConsumerWakeupException


Diffs (updated)
-

  clients/src/main/java/org/apache/kafka/clients/consumer/Consumer.java 
8f587bc0705b65b3ef37c86e0c25bb43ab8803de 
  clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerRecords.java 
1ca75f83d3667f7d01da1ae2fd9488fb79562364 
  
clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerWakeupException.java
 PRE-CREATION 
  clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
951c34c92710fc4b38d656e99d2a41255c60aeb7 
  clients/src/main/java/org/apache/kafka/clients/consumer/MockConsumer.java 
f50da825756938c193d7f07bee953e000e2627d9 
  
clients/src/main/java/org/apache/kafka/clients/consumer/OffsetResetStrategy.java
 PRE-CREATION 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Coordinator.java
 41cb9458f51875ac9418fce52f264b35adba92f4 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java 
56281ee15cc33dfc96ff64d5b1e596047c7132a4 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/Heartbeat.java
 e7cfaaad296fa6e325026a5eee1aaf9b9c0fe1fe 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestFuture.java
 PRE-CREATION 
  
clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
 cee75410127dd1b86c1156563003216d93a086b3 
  clients/src/main/java/org/apache/kafka/common/utils/Utils.java 
f73eedb030987f018d8446bb1dcd98d19fa97331 
  clients/src/test/java/org/apache/kafka/clients/consumer/MockConsumerTest.java 
677edd385f35d4262342b567262c0b874876d25b 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/CoordinatorTest.java
 1454ab73df22cce028f41f74b970628829da4e9d 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
 419541011d652becf0cda7a5e62ce813cddb1732 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/HeartbeatTest.java
 ecc78cedf59a994fcf084fa7a458fe9ed5386b00 
  
clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
 e000cf8e10ebfacd6c9ee68d7b88ff8c157f73c6 
  clients/src/test/java/org/apache/kafka/common/utils/UtilsTest.java 
2ebe3c21f611dc133a2dbb8c7dfb0845f8c21498 

Diff: https://reviews.apache.org/r/34789/diff/


Testing
---


Thanks,

Jason Gustafson



[jira] [Commented] (KAFKA-2168) New consumer poll() can block other calls like position(), commit(), and close() indefinitely

2015-06-22 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596850#comment-14596850
 ] 

Jason Gustafson commented on KAFKA-2168:


Updated reviewboard https://reviews.apache.org/r/34789/diff/
 against branch upstream/trunk

 New consumer poll() can block other calls like position(), commit(), and 
 close() indefinitely
 -

 Key: KAFKA-2168
 URL: https://issues.apache.org/jira/browse/KAFKA-2168
 Project: Kafka
  Issue Type: Sub-task
  Components: clients, consumer
Reporter: Ewen Cheslack-Postava
Assignee: Jason Gustafson
Priority: Critical
 Fix For: 0.8.3

 Attachments: KAFKA-2168.patch, KAFKA-2168_2015-06-01_16:03:38.patch, 
 KAFKA-2168_2015-06-02_17:09:37.patch, KAFKA-2168_2015-06-03_18:20:23.patch, 
 KAFKA-2168_2015-06-03_21:06:42.patch, KAFKA-2168_2015-06-04_14:36:04.patch, 
 KAFKA-2168_2015-06-05_12:01:28.patch, KAFKA-2168_2015-06-05_12:44:48.patch, 
 KAFKA-2168_2015-06-11_14:09:59.patch, KAFKA-2168_2015-06-18_14:39:36.patch, 
 KAFKA-2168_2015-06-19_09:19:02.patch, KAFKA-2168_2015-06-22_16:34:37.patch


 The new consumer is currently using very coarse-grained synchronization. For 
 most methods this isn't a problem since they finish quickly once the lock is 
 acquired, but poll() might run for a long time (and commonly will since 
 polling with long timeouts is a normal use case). This means any operations 
 invoked from another thread may block until the poll() call completes.
 Some example use cases where this can be a problem:
 * A shutdown hook is registered to trigger shutdown and invokes close(). It 
 gets invoked from another thread and blocks indefinitely.
 * User wants to manage offset commit themselves in a background thread. If 
 the commit policy is not purely time based, it's not currently possibly to 
 make sure the call to commit() will be processed promptly.
 Two possible solutions to this:
 1. Make sure a lock is not held during the actual select call. Since we have 
 multiple layers (KafkaConsumer - NetworkClient - Selector - nio Selector) 
 this is probably hard to make work cleanly since locking is currently only 
 performed at the KafkaConsumer level and we'd want it unlocked around a 
 single line of code in Selector.
 2. Wake up the selector before synchronizing for certain operations. This 
 would require some additional coordination to make sure the caller of 
 wakeup() is able to acquire the lock promptly (instead of, e.g., the poll() 
 thread being woken up and then promptly reacquiring the lock with a 
 subsequent long poll() call).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2168) New consumer poll() can block other calls like position(), commit(), and close() indefinitely

2015-06-22 Thread Jason Gustafson (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-2168:
---
Attachment: KAFKA-2168_2015-06-22_16:34:37.patch

 New consumer poll() can block other calls like position(), commit(), and 
 close() indefinitely
 -

 Key: KAFKA-2168
 URL: https://issues.apache.org/jira/browse/KAFKA-2168
 Project: Kafka
  Issue Type: Sub-task
  Components: clients, consumer
Reporter: Ewen Cheslack-Postava
Assignee: Jason Gustafson
Priority: Critical
 Fix For: 0.8.3

 Attachments: KAFKA-2168.patch, KAFKA-2168_2015-06-01_16:03:38.patch, 
 KAFKA-2168_2015-06-02_17:09:37.patch, KAFKA-2168_2015-06-03_18:20:23.patch, 
 KAFKA-2168_2015-06-03_21:06:42.patch, KAFKA-2168_2015-06-04_14:36:04.patch, 
 KAFKA-2168_2015-06-05_12:01:28.patch, KAFKA-2168_2015-06-05_12:44:48.patch, 
 KAFKA-2168_2015-06-11_14:09:59.patch, KAFKA-2168_2015-06-18_14:39:36.patch, 
 KAFKA-2168_2015-06-19_09:19:02.patch, KAFKA-2168_2015-06-22_16:34:37.patch


 The new consumer is currently using very coarse-grained synchronization. For 
 most methods this isn't a problem since they finish quickly once the lock is 
 acquired, but poll() might run for a long time (and commonly will since 
 polling with long timeouts is a normal use case). This means any operations 
 invoked from another thread may block until the poll() call completes.
 Some example use cases where this can be a problem:
 * A shutdown hook is registered to trigger shutdown and invokes close(). It 
 gets invoked from another thread and blocks indefinitely.
 * User wants to manage offset commit themselves in a background thread. If 
 the commit policy is not purely time based, it's not currently possibly to 
 make sure the call to commit() will be processed promptly.
 Two possible solutions to this:
 1. Make sure a lock is not held during the actual select call. Since we have 
 multiple layers (KafkaConsumer - NetworkClient - Selector - nio Selector) 
 this is probably hard to make work cleanly since locking is currently only 
 performed at the KafkaConsumer level and we'd want it unlocked around a 
 single line of code in Selector.
 2. Wake up the selector before synchronizing for certain operations. This 
 would require some additional coordination to make sure the caller of 
 wakeup() is able to acquire the lock promptly (instead of, e.g., the poll() 
 thread being woken up and then promptly reacquiring the lock with a 
 subsequent long poll() call).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 35734: Patch for KAFKA-2293

2015-06-22 Thread Joel Koshy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35734/#review88866
---

Ship it!


Minor comment: would use `%s` with `TopicAndPartition` instead of `%s,%d`. I'll 
do this on check-in.

- Joel Koshy


On June 22, 2015, 5:35 p.m., Aditya Auradkar wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35734/
 ---
 
 (Updated June 22, 2015, 5:35 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2293
 https://issues.apache.org/jira/browse/KAFKA-2293
 
 
 Repository: kafka
 
 
 Description
 ---
 
 Fix for 2293
 
 
 Diffs
 -
 
   core/src/main/scala/kafka/cluster/Partition.scala 
 6cb647711191aee8d36e9ff15bdc2af4f1c95457 
 
 Diff: https://reviews.apache.org/r/35734/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Aditya Auradkar
 




Re: Review Request 35655: Patch for KAFKA-2271

2015-06-22 Thread Jason Gustafson


 On June 21, 2015, 5:49 a.m., Guozhang Wang wrote:
  I think the main reason is that util.Random.nextString() may include other 
  non-ascii chars and hence its toString may not be well-defined:
  
  http://alvinalexander.com/scala/creating-random-strings-in-scala
  
  So I think bottom-line is that we should not use util.Random.nextString to 
  generate random strings. There are a couple of other places where it it 
  used and I suggest we remove them as well.

Since Java uses utf-16 internally, nextString is virtually guaranteed to be 
non-ascii. But the problem was actually the additional space which was being 
trimmed. Nevertheless, we might want to avoid using non-ascii characters in 
assertions unles we can find a reasonable way to display them in test results. 
The other usages seem legitimate though. One of them explicitly expects 
non-ascii strings (ApiUtilsTest.testShortStringNonASCII), and the others just 
use them as arbitrary data of a certain length 
(OffsetCommitTest.testLargeMetadataPayload).


- Jason


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35655/#review88690
---


On June 19, 2015, 4:48 p.m., Jason Gustafson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35655/
 ---
 
 (Updated June 19, 2015, 4:48 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2271
 https://issues.apache.org/jira/browse/KAFKA-2271
 
 
 Repository: kafka
 
 
 Description
 ---
 
 KAFKA-2271; fix minor test bugs
 
 
 Diffs
 -
 
   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
 98a5b042a710d3c1064b0379db1d152efc9eabee 
 
 Diff: https://reviews.apache.org/r/35655/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Jason Gustafson
 




Re: Review Request 35655: Patch for KAFKA-2271

2015-06-22 Thread Jason Gustafson


 On June 22, 2015, 11:43 p.m., Ewen Cheslack-Postava wrote:
  LGTM. There were 9 question marks when 10 characters were requested, so the 
  problem was probably just that a whitespace character at the start or end 
  would get trimmed during AbstractConfig's parsing.

That's what I thought as well, but I was puzzled that I couldn't reproduce it. 
In fact, it looks like the issue was fixed with KAFKA-2249, which preserves the 
original properties that were used to construct the config. In that case, 
however, the assertion basically becomes a tautology, so perhaps we should just 
remove the test case?


- Jason


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35655/#review88877
---


On June 19, 2015, 4:48 p.m., Jason Gustafson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35655/
 ---
 
 (Updated June 19, 2015, 4:48 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2271
 https://issues.apache.org/jira/browse/KAFKA-2271
 
 
 Repository: kafka
 
 
 Description
 ---
 
 KAFKA-2271; fix minor test bugs
 
 
 Diffs
 -
 
   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
 98a5b042a710d3c1064b0379db1d152efc9eabee 
 
 Diff: https://reviews.apache.org/r/35655/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Jason Gustafson
 




[jira] [Commented] (KAFKA-1646) Improve consumer read performance for Windows

2015-06-22 Thread Honghai Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595450#comment-14595450
 ] 

Honghai Chen commented on KAFKA-1646:
-

See the commit 
https://git-wip-us.apache.org/repos/asf?p=kafka.git;a=commit;h=ca758252c5a524fe6135a585282dd4bf747afef2
 
Many thanks for everyone for your help to make this happen.

 Improve consumer read performance for Windows
 -

 Key: KAFKA-1646
 URL: https://issues.apache.org/jira/browse/KAFKA-1646
 Project: Kafka
  Issue Type: Improvement
  Components: log
Affects Versions: 0.8.1.1
 Environment: Windows
Reporter: xueqiang wang
Assignee: Honghai Chen
  Labels: newbie, patch
 Fix For: 0.8.3

 Attachments: Improve consumer read performance for Windows.patch, 
 KAFKA-1646-truncate-off-trailing-zeros-on-broker-restart-if-bro.patch, 
 KAFKA-1646_20141216_163008.patch, KAFKA-1646_20150306_005526.patch, 
 KAFKA-1646_20150511_AddTestcases.patch, 
 KAFKA-1646_20150609_MergeToLatestTrunk.patch, 
 KAFKA-1646_20150616_FixFormat.patch, KAFKA-1646_20150618_235231.patch


 This patch is for Window platform only. In Windows platform, if there are 
 more than one replicas writing to disk, the segment log files will not be 
 consistent in disk and then consumer reading performance will be dropped down 
 greatly. This fix allocates more disk spaces when rolling a new segment, and 
 then it will improve the consumer reading performance in NTFS file system.
 This patch doesn't affect file allocation of other filesystems, for it only 
 adds statements like 'if(Os.iswindow)' or adds methods used on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-2235) LogCleaner offset map overflow

2015-06-22 Thread Ivan Simoneko (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Simoneko updated KAFKA-2235:
-
Attachment: KAFKA-2235_v2.patch

 LogCleaner offset map overflow
 --

 Key: KAFKA-2235
 URL: https://issues.apache.org/jira/browse/KAFKA-2235
 Project: Kafka
  Issue Type: Bug
  Components: core, log
Affects Versions: 0.8.1, 0.8.2.0
Reporter: Ivan Simoneko
Assignee: Jay Kreps
 Attachments: KAFKA-2235_v1.patch, KAFKA-2235_v2.patch


 We've seen log cleaning generating an error for a topic with lots of small 
 messages. It seems that cleanup map overflow is possible if a log segment 
 contains more unique keys than empty slots in offsetMap. Check for baseOffset 
 and map utilization before processing segment seems to be not enough because 
 it doesn't take into account segment size (number of unique messages in the 
 segment).
 I suggest to estimate upper bound of keys in a segment as a number of 
 messages in the segment and compare it with the number of available slots in 
 the map (keeping in mind desired load factor). It should work in cases where 
 an empty map is capable to hold all the keys for a single segment. If even a 
 single segment no able to fit into an empty map cleanup process will still 
 fail. Probably there should be a limit on the log segment entries count?
 Here is the stack trace for this error:
 2015-05-19 16:52:48,758 ERROR [kafka-log-cleaner-thread-0] 
 kafka.log.LogCleaner - [kafka-log-cleaner-thread-0], Error due to
 java.lang.IllegalArgumentException: requirement failed: Attempt to add a new 
 entry to a full offset map.
at scala.Predef$.require(Predef.scala:233)
at kafka.log.SkimpyOffsetMap.put(OffsetMap.scala:79)
at 
 kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:543)
at 
 kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:538)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at kafka.message.MessageSet.foreach(MessageSet.scala:67)
at 
 kafka.log.Cleaner.kafka$log$Cleaner$$buildOffsetMapForSegment(LogCleaner.scala:538)
at 
 kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:515)
at 
 kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:512)
at scala.collection.immutable.Stream.foreach(Stream.scala:547)
at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:512)
at kafka.log.Cleaner.clean(LogCleaner.scala:307)
at 
 kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:221)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:199)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

2015-06-22 Thread Jiangjie Qin
Very useful KIP.
I have no clear opinion over where to put the framework will be better yet.
I agree with Gwen on the benefits we can get from have a separate project
for Copycat. But still have a few questions:

1. As far as code is concerned, Copycat would be some datasource adapters
+ Kafka clients. My guess is for most people who wants to contribute to
Copycat, the code would be on data source adapter part, while Kafka
clients part will rarely be touched. The framework itself probably only
needs change when some changes are mede to Kafka. If that is the case, it
seems cleaner to make connectors as a separate library project instead of
having a static framework along with it?

2. I am not sure whether it matters or not. Say if I¹m a user and only
want to use Copycat while Kafka cluster is maintained by someone else. If
we package Copycat with Kafka, I have to get the entire Kafka even if I
only want Copycat. Is it necessary if we want to guarantee compatibility
between Copycat and Kafka?

That said, I kind of think the packaging should depend on:
How tightly coupled it is between Kafka and Copycat vs. between Connectors
and Copycat.
How easily user can use.

Thanks,

Jiangjie (Becket) Qin

On 6/21/15, 9:24 PM, Gwen Shapira gshap...@cloudera.com wrote:

Ah, I see this in rejected alternatives now. Sorry :)

I actually prefer the idea of a separate project for framework +
connectors over having the framework be part of Apache Kafka.

Looking at nearby examples: Hadoop has created a wide ecosystem of
projects, with Sqoop and Flume supplying connectors. Spark on the
other hand keeps its subprojects as part of Apache Spark.

When I look at both projects, I see that Flume and Sqoop created
active communities (that was especially true a few years back when we
were rapidly growing), with many companies contributing. Spark OTOH
(and with all respect to my friends at Spark), has tons of
contributors to its core, but much less activity on its sub-projects
(for example, SparkStreaming). I strongly believe that SparkStreaming
is under-served by being a part of Spark, especially when compared to
Storm which is an independent project with its own community.

The way I see it, connector frameworks are significantly simpler than
distributed data stores (although they are pretty large in terms of
code base, especially with copycat having its own distributed
processing framework). Which means that the barrier to contribution to
connector frameworks is lower, both for contributing to the framework
and for contributing connectors. Separate communities can also have
different rules regarding dependencies and committership.
Committership is the big one, and IMO what prevents SparkStreaming
from growing - I can give someone commit bit on Sqoop without giving
them any power over Hadoop. Not true for Spark and SparkStreaming.
This means that a CopyCat community (with its own sexy cat logo) will
be able to attract more volunteers and grow at a faster pace than core
Kafka, making it more useful to the community.

The other part is that just like Kafka will be more useful with a
connector framework, a connector framework tends to work better when
there are lots of connectors. So if we decide to partition the Kafka /
Connector framework / Connectors triad, I'm not sure which
partitioning makes more sense. Giving CopyCat (I love the name. You
can say things like get the data into MySQL and CC Kafka) its own
community will allow the CopyCat community to accept connector
contributions, which is good for CopyCat and for Kafka adoption.
Oracle and Netezza contributed connectors to Sqoop, they probably
couldn't contribute it at all if Sqoop was inside Hadoop, and they
can't really opensource their own stuff through Github, so it was a
win for our community. This doesn't negate the possibility to create
connectors for CopyCat and not contribute them to the community (like
the popular Teradata connector for Sqoop).

Regarding ease of use and adoption: Right now, a lot of people adopt
Kafka as stand-alone piece, while Hadoop usually shows up through a
distribution. I expect that soon people will start adopting Kafka
through distributions, so the framework and a collection of connectors
will be part of every distribution. In the same way that no one thinks
of Sqoop or Flume as stand alone projects. With a bunch of Kafka
distributions out there, people will get Kafka + Framework +
Connectors, with a core connection portion being common to multiple
distributions - this will allow even easier adoption, while allowing
the Kafka community to focus on core Kafka.

The point about documentation that Ewen has made in the KIP is a good
one. We definitely want to point people to the right place for export
/ import tools. However, it sounds solvable with few links.

Sorry for the lengthy essay - I'm a bit passionate about connectors
and want to see CopyCat off to a great start in life :)

(BTW. I think Apache is a great place for CopyCat. I'll be happy 

[jira] [Commented] (KAFKA-2235) LogCleaner offset map overflow

2015-06-22 Thread Ivan Simoneko (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595798#comment-14595798
 ] 

Ivan Simoneko commented on KAFKA-2235:
--

[~junrao] thank you for review. Please check the patch v2. I think in most 
cases mentioning log.cleaner.dedupe.buffer.size should be enough, but as 
log.cleaner.threads is also used in determining map size I've added both of 
them. If someone increases threads num and start getting this message he can 
easily understand cause of the problem

 LogCleaner offset map overflow
 --

 Key: KAFKA-2235
 URL: https://issues.apache.org/jira/browse/KAFKA-2235
 Project: Kafka
  Issue Type: Bug
  Components: core, log
Affects Versions: 0.8.1, 0.8.2.0
Reporter: Ivan Simoneko
Assignee: Jay Kreps
 Attachments: KAFKA-2235_v1.patch, KAFKA-2235_v2.patch


 We've seen log cleaning generating an error for a topic with lots of small 
 messages. It seems that cleanup map overflow is possible if a log segment 
 contains more unique keys than empty slots in offsetMap. Check for baseOffset 
 and map utilization before processing segment seems to be not enough because 
 it doesn't take into account segment size (number of unique messages in the 
 segment).
 I suggest to estimate upper bound of keys in a segment as a number of 
 messages in the segment and compare it with the number of available slots in 
 the map (keeping in mind desired load factor). It should work in cases where 
 an empty map is capable to hold all the keys for a single segment. If even a 
 single segment no able to fit into an empty map cleanup process will still 
 fail. Probably there should be a limit on the log segment entries count?
 Here is the stack trace for this error:
 2015-05-19 16:52:48,758 ERROR [kafka-log-cleaner-thread-0] 
 kafka.log.LogCleaner - [kafka-log-cleaner-thread-0], Error due to
 java.lang.IllegalArgumentException: requirement failed: Attempt to add a new 
 entry to a full offset map.
at scala.Predef$.require(Predef.scala:233)
at kafka.log.SkimpyOffsetMap.put(OffsetMap.scala:79)
at 
 kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:543)
at 
 kafka.log.Cleaner$$anonfun$kafka$log$Cleaner$$buildOffsetMapForSegment$1.apply(LogCleaner.scala:538)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at kafka.utils.IteratorTemplate.foreach(IteratorTemplate.scala:32)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at kafka.message.MessageSet.foreach(MessageSet.scala:67)
at 
 kafka.log.Cleaner.kafka$log$Cleaner$$buildOffsetMapForSegment(LogCleaner.scala:538)
at 
 kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:515)
at 
 kafka.log.Cleaner$$anonfun$buildOffsetMap$3.apply(LogCleaner.scala:512)
at scala.collection.immutable.Stream.foreach(Stream.scala:547)
at kafka.log.Cleaner.buildOffsetMap(LogCleaner.scala:512)
at kafka.log.Cleaner.clean(LogCleaner.scala:307)
at 
 kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:221)
at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:199)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (KAFKA-2290) OffsetIndex should open RandomAccessFile consistently

2015-06-22 Thread Chris Black (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Black reassigned KAFKA-2290:
--

Assignee: Chris Black

 OffsetIndex should open RandomAccessFile consistently
 -

 Key: KAFKA-2290
 URL: https://issues.apache.org/jira/browse/KAFKA-2290
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.9.0
Reporter: Jun Rao
Assignee: Chris Black
  Labels: newbie

 We open RandomAccessFile in rw mode in the constructor, but in rws mode 
 in resize(). We should use rw in both cases since it's more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2092) New partitioning for better load balancing

2015-06-22 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14595510#comment-14595510
 ] 

Gianmarco De Francisci Morales commented on KAFKA-2092:
---

Any more thoughts on this?

 New partitioning for better load balancing
 --

 Key: KAFKA-2092
 URL: https://issues.apache.org/jira/browse/KAFKA-2092
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Reporter: Gianmarco De Francisci Morales
Assignee: Jun Rao
 Attachments: KAFKA-2092-v1.patch


 We have recently studied the problem of load balancing in distributed stream 
 processing systems such as Samza [1].
 In particular, we focused on what happens when the key distribution of the 
 stream is skewed when using key grouping.
 We developed a new stream partitioning scheme (which we call Partial Key 
 Grouping). It achieves better load balancing than hashing while being more 
 scalable than round robin in terms of memory.
 In the paper we show a number of mining algorithms that are easy to implement 
 with partial key grouping, and whose performance can benefit from it. We 
 think that it might also be useful for a larger class of algorithms.
 PKG has already been integrated in Storm [2], and I would like to be able to 
 use it in Samza as well. As far as I understand, Kafka producers are the ones 
 that decide how to partition the stream (or Kafka topic).
 I do not have experience with Kafka, however partial key grouping is very 
 easy to implement: it requires just a few lines of code in Java when 
 implemented as a custom grouping in Storm [3].
 I believe it should be very easy to integrate.
 For all these reasons, I believe it will be a nice addition to Kafka/Samza. 
 If the community thinks it's a good idea, I will be happy to offer support in 
 the porting.
 References:
 [1] 
 https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
 [2] https://issues.apache.org/jira/browse/STORM-632
 [3] https://github.com/gdfm/partial-key-grouping



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-2092) New partitioning for better load balancing

2015-06-22 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589634#comment-14589634
 ] 

Gianmarco De Francisci Morales edited comment on KAFKA-2092 at 6/22/15 8:42 AM:


Thanks for your comment [~jkreps].
Indeed, this uses the load estimated at the producer to infer the load at the 
consumer. You might think this does not work but indeed it does in most cases 
(see [1] for details). I am not sure whether the lifecycle of the producer has 
any impact here. The goal is simply to send balanced partitions out of the 
producer.

Regarding the key=partition mapping, yes this breaks the 1 key to 1 partition 
mapping. That's exactly the point, to offer a new primitive for stream 
partitioning. If you are doing word count you need a final aggregator as you 
say, but the aggregation is O(1) rather than O(W) [where W is the number of 
workers, i.e., parallelism of the operator]. Also, if you imagine building 
views out of these partitions, you can query 2 views rather than 1 to obtain 
the final answer (again, compared to shuffle grouping where you need W queries).

I disagree with your last point (and the results do too). Given that you have 2 
options, the imbalance is reduced much more than just by 2 times, because you 
create options to offload part of the load on a heavy partition to the second 
choice, thus creating a network of backup/offload options to move to when one 
key becomes hot. It's as creating interconnected pipes where you pump a fluid 
into.

What is true is that if the single heavy key is larger than (2/W)% of the 
stream, then this technique cannot help you to achieve perfect load balance.


was (Author: azaroth):
Thanks for your comment [~jkreps].
Indeed, this uses the load estimated at the producer to infer the load at the 
consumer. You might think this does not work but indeed it does in most cases 
(see [1] for details). I am not sure whether the lifecycle of the producer has 
any impact here. The goal is simply to send balanced partitions out of the 
producer.

Regarding the key=partition mapping, yes this breaks the 1 key to 1 partition 
mapping. That's exactly the point, to offer a new primitive for stream 
partitioning. If you are doing word count you need a final aggregator as you 
say, but the aggregation is O(1) rather than O(W) [where W is the number of 
workers, i.e., parallelism of the operator]. Also, if you imagine building 
views out of these partitions, you can query 2 views rather than 1 to obtain 
the final answer (again, compared to shuffle grouping where you need p queries).

I disagree with your last point (and the results do too). Given that you have 2 
options, the imbalance is reduced much more than just by 2 times, because you 
create options to offload part of the load on a heavy partition to the second 
choice, thus creating a network of backup/offload options to move to when one 
key becomes hot. It's as creating interconnected pipes where you pump a fluid 
into.

What is true is that if the single heavy key is larger than (2/W)% of the 
stream, then this technique cannot help you to achieve perfect load balance.

 New partitioning for better load balancing
 --

 Key: KAFKA-2092
 URL: https://issues.apache.org/jira/browse/KAFKA-2092
 Project: Kafka
  Issue Type: Improvement
  Components: producer 
Reporter: Gianmarco De Francisci Morales
Assignee: Jun Rao
 Attachments: KAFKA-2092-v1.patch


 We have recently studied the problem of load balancing in distributed stream 
 processing systems such as Samza [1].
 In particular, we focused on what happens when the key distribution of the 
 stream is skewed when using key grouping.
 We developed a new stream partitioning scheme (which we call Partial Key 
 Grouping). It achieves better load balancing than hashing while being more 
 scalable than round robin in terms of memory.
 In the paper we show a number of mining algorithms that are easy to implement 
 with partial key grouping, and whose performance can benefit from it. We 
 think that it might also be useful for a larger class of algorithms.
 PKG has already been integrated in Storm [2], and I would like to be able to 
 use it in Samza as well. As far as I understand, Kafka producers are the ones 
 that decide how to partition the stream (or Kafka topic).
 I do not have experience with Kafka, however partial key grouping is very 
 easy to implement: it requires just a few lines of code in Java when 
 implemented as a custom grouping in Storm [3].
 I believe it should be very easy to integrate.
 For all these reasons, I believe it will be a nice addition to Kafka/Samza. 
 If the community thinks it's a good idea, I will be happy to offer support in 
 the porting.
 

[jira] [Updated] (KAFKA-2290) OffsetIndex should open RandomAccessFile consistently

2015-06-22 Thread Chris Black (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Black updated KAFKA-2290:
---
Attachment: KAFKA-2290.patch

 OffsetIndex should open RandomAccessFile consistently
 -

 Key: KAFKA-2290
 URL: https://issues.apache.org/jira/browse/KAFKA-2290
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 0.9.0
Reporter: Jun Rao
Assignee: Chris Black
  Labels: newbie
 Attachments: KAFKA-2290.patch


 We open RandomAccessFile in rw mode in the constructor, but in rws mode 
 in resize(). We should use rw in both cases since it's more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-2276) Initial patch for KIP-25

2015-06-22 Thread Geoffrey Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14596624#comment-14596624
 ] 

Geoffrey Anderson edited comment on KAFKA-2276 at 6/22/15 9:00 PM:
---

Pull request is here: https://github.com/apache/kafka/pull/70

More info:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-25+-+System+test+improvements
https://cwiki.apache.org/confluence/display/KAFKA/Roadmap+-+port+existing+system+tests


was (Author: granders):
Pull request is here: https://github.com/apache/kafka/pull/70

 Initial patch for KIP-25
 

 Key: KAFKA-2276
 URL: https://issues.apache.org/jira/browse/KAFKA-2276
 Project: Kafka
  Issue Type: Bug
Reporter: Geoffrey Anderson
Assignee: Geoffrey Anderson

 Submit initial patch for KIP-25 
 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-25+-+System+test+improvements)
 This patch should contain a few Service classes and a few tests which can 
 serve as examples 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-2294) javadoc compile error due to illegal p/ , build failing (jdk 8)

2015-06-22 Thread Jeremy Fields (JIRA)
Jeremy Fields created KAFKA-2294:


 Summary: javadoc compile error due to illegal p/ , build failing 
(jdk 8)
 Key: KAFKA-2294
 URL: https://issues.apache.org/jira/browse/KAFKA-2294
 Project: Kafka
  Issue Type: Bug
Reporter: Jeremy Fields


Quick one,

kafka/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java:525:
 error: self-closing element not allowed
 * p/

This is causing build to fail under java 8 due to strict html checking.

Replace that p/ with p

Regards,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] KIP-26 - Add Copycat, a connector framework for data import/export

2015-06-22 Thread Roshan Naik
Thanks Jay and Ewen for the response.


@Jay

 3. This has a built in notion of parallelism throughout.



It was not obvious how it will look like or differ from existing systemsŠ
since all of existing ones do parallelize data movement.


@Ewen,

Import: Flume is just one of many similar systems designed around log
collection. See notes below, but one major point is that they generally
don't provide any sort of guaranteed delivery semantics.


I think most of them do provide guarantees of some sort (Ex. Flume 
FluentD). 


YARN: My point isn't that YARN is bad, it's that tying to any particular
cluster manager severely limits the applicability of the tool. The goal is
to make Copycat agnostic to the cluster manager so it can run under Mesos,
YARN, etc.

ok. Got it. Sounds like there is plan to do some work here to ensure
out-of-the-box it works with more than one scheduler (as @Jay listed out).
In that case, IMO it would be better to actually rephrase it in the KIP
that it will support more than one scheduler.


Exactly once: You accomplish this in any system by managing offsets in the
destination system atomically with the data or through some kind of
deduplication. Jiangjie actually just gave a great talk about this issue
at
a recent Kafka meetup, perhaps he can share some slides about it. When you
see all the details involved, you'll see why I think it might be nice to
have the framework help you manage the complexities of achieving different
delivery semantics ;)


Deduplication as a post processing step is a common recommendation done
today Š but that is a workaround/fix for the inability to provide
exactly-once by the delivery systems. IMO such post processing should not
be considered part of the exacty-once guarantee of Copycat.


Will be good to know how this guarantee will be possible when delivering
to HDFS.
Would be great if someone can share those slides if it is discussed there.




Was looking for clarification on this ..
- Export side - is this like a map reduce kind of job or something else ?
If delivering to hdfs would this be running on the hadoop cluster or
outside ?
- Import side - how does this look ? Is it a bunch of flume like processes
? maybe just some kind of a broker that translates the incoming protocol
into outgoing Kafka producer api protocol ? If delivering to hdfs, will
this run on the cluster or outside ?


I still think adding one or two specific end-to-end use-cases in the KIP,
showing how copycat will pan out for them for import/export will really
clarify things.





Re: Review Request 35655: Patch for KAFKA-2271

2015-06-22 Thread Ewen Cheslack-Postava


 On June 22, 2015, 11:43 p.m., Ewen Cheslack-Postava wrote:
  LGTM. There were 9 question marks when 10 characters were requested, so the 
  problem was probably just that a whitespace character at the start or end 
  would get trimmed during AbstractConfig's parsing.
 
 Jason Gustafson wrote:
 That's what I thought as well, but I was puzzled that I couldn't 
 reproduce it. In fact, it looks like the issue was fixed with KAFKA-2249, 
 which preserves the original properties that were used to construct the 
 config. In that case, however, the assertion basically becomes a tautology, 
 so perhaps we should just remove the test case?

That makes sense. I agree that this test doesn't seem all that useful anymore. 
I think all the old config classes had tests since they were all written 
manually. Since the configs are now defined more declaratively via 
AbstractConfig, these types of tests seem a lot less relevant. It doesn't look 
like we've generated tests for any other uses of AbstractConfig. Even other 
tests in that same class are just sort of redundant, e.g. testFromPropsInvalid. 
It is, I suppose checking that the type is set correctly in the ConfigDef, but 
mostly it's just retesting the common parsing functionality repeatedly.


- Ewen


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35655/#review88877
---


On June 19, 2015, 4:48 p.m., Jason Gustafson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/35655/
 ---
 
 (Updated June 19, 2015, 4:48 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2271
 https://issues.apache.org/jira/browse/KAFKA-2271
 
 
 Repository: kafka
 
 
 Description
 ---
 
 KAFKA-2271; fix minor test bugs
 
 
 Diffs
 -
 
   core/src/test/scala/unit/kafka/server/KafkaConfigConfigDefTest.scala 
 98a5b042a710d3c1064b0379db1d152efc9eabee 
 
 Diff: https://reviews.apache.org/r/35655/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Jason Gustafson
 




Re: Review Request 34789: Patch for KAFKA-2168

2015-06-22 Thread Jun Rao

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34789/#review88914
---

Ship it!


Thanks for the latest patch. Looks good overall. To avoid holding to this 
relative large patch for too long, I am committed the latest patch to trunk. 
There are a few minor comments below and we can commit any necessary fix in a 
follow up patch.


clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
(lines 335 - 338)
https://reviews.apache.org/r/34789/#comment141530

These seem redundant give the code below.



clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
(line 420)
https://reviews.apache.org/r/34789/#comment141531

Should this be volatile so that different threads can see the latest value 
of refcount?



clients/src/main/java/org/apache/kafka/clients/consumer/internals/Coordinator.java
 (line 319)
https://reviews.apache.org/r/34789/#comment141529

What's the logic to initiate connection to coordinator if the coordinator 
is not available during HB?


- Jun Rao


On June 22, 2015, 11:35 p.m., Jason Gustafson wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/34789/
 ---
 
 (Updated June 22, 2015, 11:35 p.m.)
 
 
 Review request for kafka.
 
 
 Bugs: KAFKA-2168
 https://issues.apache.org/jira/browse/KAFKA-2168
 
 
 Repository: kafka
 
 
 Description
 ---
 
 KAFKA-2168; refactored callback handling to prevent unnecessary requests
 
 
 KAFKA-2168; address review comments
 
 
 KAFKA-2168; fix rebase error and checkstyle issue
 
 
 KAFKA-2168; address review comments and add docs
 
 
 KAFKA-2168; handle polling with timeout 0
 
 
 KAFKA-2168; timeout=0 means return immediately
 
 
 KAFKA-2168; address review comments
 
 
 KAFKA-2168; address more review comments
 
 
 KAFKA-2168; updated for review comments
 
 
 KAFKA-2168; add serialVersionUID to ConsumerWakeupException
 
 
 Diffs
 -
 
   clients/src/main/java/org/apache/kafka/clients/consumer/Consumer.java 
 8f587bc0705b65b3ef37c86e0c25bb43ab8803de 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerRecords.java 
 1ca75f83d3667f7d01da1ae2fd9488fb79562364 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerWakeupException.java
  PRE-CREATION 
   clients/src/main/java/org/apache/kafka/clients/consumer/KafkaConsumer.java 
 951c34c92710fc4b38d656e99d2a41255c60aeb7 
   clients/src/main/java/org/apache/kafka/clients/consumer/MockConsumer.java 
 f50da825756938c193d7f07bee953e000e2627d9 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/OffsetResetStrategy.java
  PRE-CREATION 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/Coordinator.java
  41cb9458f51875ac9418fce52f264b35adba92f4 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java
  56281ee15cc33dfc96ff64d5b1e596047c7132a4 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/Heartbeat.java
  e7cfaaad296fa6e325026a5eee1aaf9b9c0fe1fe 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/RequestFuture.java
  PRE-CREATION 
   
 clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java
  cee75410127dd1b86c1156563003216d93a086b3 
   clients/src/main/java/org/apache/kafka/common/utils/Utils.java 
 f73eedb030987f018d8446bb1dcd98d19fa97331 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/MockConsumerTest.java 
 677edd385f35d4262342b567262c0b874876d25b 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/CoordinatorTest.java
  1454ab73df22cce028f41f74b970628829da4e9d 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java
  419541011d652becf0cda7a5e62ce813cddb1732 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/HeartbeatTest.java
  ecc78cedf59a994fcf084fa7a458fe9ed5386b00 
   
 clients/src/test/java/org/apache/kafka/clients/consumer/internals/SubscriptionStateTest.java
  e000cf8e10ebfacd6c9ee68d7b88ff8c157f73c6 
   clients/src/test/java/org/apache/kafka/common/utils/UtilsTest.java 
 2ebe3c21f611dc133a2dbb8c7dfb0845f8c21498 
 
 Diff: https://reviews.apache.org/r/34789/diff/
 
 
 Testing
 ---
 
 
 Thanks,
 
 Jason Gustafson
 




Re: [GitHub] kafka pull request: Kafka 2276

2015-06-22 Thread Gwen Shapira
Thanks, I indeed missed the original :)

Is the plan to squash the commits and merge a pull request with single
commit that matches the JIRA #?
This will be more in line with how commits were organized until now
and will make life much easier when cherry-picking.

Gwen

On Mon, Jun 22, 2015 at 1:58 PM, Geoffrey Anderson ge...@confluent.io wrote:
 Hi,

 I'm pinging the dev list regarding KAFKA-2276 (KIP-25 initial patch) again
 since it sounds like at least one person I spoke with did not see the
 initial pull request.

 Pull request: https://github.com/apache/kafka/pull/70/
 JIRA: https://issues.apache.org/jira/browse/KAFKA-2276

 Thanks!
 Geoff


 On Tue, Jun 16, 2015 at 2:50 PM, granders g...@git.apache.org wrote:

 GitHub user granders opened a pull request:

 https://github.com/apache/kafka/pull/70

 Kafka 2276

 Initial patch for KIP-25

 Note that to install ducktape, do *not* use pip to install ducktape.
 Instead:

 ```
 $ git clone g...@github.com:confluentinc/ducktape.git
 $ cd ducktape
 $ python setup.py install
 ```


 You can merge this pull request into a Git repository by running:

 $ git pull https://github.com/confluentinc/kafka KAFKA-2276

 Alternatively you can review and apply these changes as the patch at:

 https://github.com/apache/kafka/pull/70.patch

 To close this pull request, make a commit to your master/trunk branch
 with (at least) the following in the commit message:

 This closes #70

 
 commit 81e41562f3836e95e89e12f215c82b1b2d505381
 Author: Liquan Pei liquan...@gmail.com
 Date:   2015-04-24T01:32:54Z

 Bootstrap Kafka system tests

 commit f1914c3ba9b52d0f8db3989c8b031127b42ac59e
 Author: Liquan Pei liquan...@gmail.com
 Date:   2015-04-24T01:33:44Z

 Merge pull request #2 from confluentinc/system_tests

 Bootstrap Kafka system tests

 commit a2789885806f98dcd1fd58edc9a10a30e4bd314c
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-26T22:21:23Z

 fixed typos

 commit 07cd1c66a952ee29fc3c8e85464acb43a6981b8a
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-26T22:22:14Z

 Added simple producer which prints status of produced messages to
 stdout.

 commit da94b8cbe79e6634cc32fbe8f6deb25388923029
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-27T21:07:20Z

 Added number of messages option.

 commit 212b39a2d75027299fbb1b1008d463a82aab
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-27T22:35:06Z

 Added some metadata to producer output.

 commit 8b4b1f2aa9681632ef65aa92dfd3066cd7d62851
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-05-29T23:38:32Z

 Minor updates to VerboseProducer

 commit c0526fe44cea739519a0889ebe9ead01b406b365
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-01T02:27:15Z

 Updates per review comments.

 commit bc009f218e00241cbdd23931d01b52c442eef6b7
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-01T02:28:28Z

 Got rid of VerboseProducer in core (moved to clients)

 commit 475423bb642ac8f816e8080f891867a6362c17fa
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-01T04:05:09Z

 Convert class to string before adding to json object.

 commit 0a5de8e0590e3a8dce1a91769ad41497b5e07d17
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-02T22:46:52Z

 Fixed checkstyle errors. Changed name to VerifiableProducer. Added
 synchronization for thread safety on println statements.

 commit 9100417ce0717a71c822c5a279fe7858bfe7a7ee
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-03T19:50:11Z

 Updated command-line options for VerifiableProducer. Extracted
 throughput logic to make it reusable.

 commit 1228eefc4e52b58c214b3ad45feab36a475d5a66
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T01:09:14Z

 Renamed throttler

 commit 6842ed1ffad62a84df67a0f0b6a651a6df085d12
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T01:12:11Z

 left out a file from last commit

 commit d586fb0eb63409807c02f280fae786cec55fb348
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T01:22:34Z

 Updated comments to reflect that throttler is not message-specific

 commit a80a4282ba9a288edba7cdf409d31f01ebf3d458
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T20:47:21Z

 Added shell program for VerifiableProducer.

 commit 51a94fd6ece926bcdd864af353efcf4c4d1b8ad8
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-04T20:55:02Z

 Use argparse4j instead of joptsimple. ThroughputThrottler now has more
 intuitive behavior when targetThroughput is 0.

 commit 632be12d2384bfd1ed3b057913dfd363cab71726
 Author: Geoff grand...@gmail.com
 Date:   2015-06-04T22:22:44Z

 Merge pull request #3 from confluentinc/verbose-client

 Verbose client

 commit fc7c81c1f6cce497c19da34f7c452ee44800ab6d
 Author: Geoff Anderson ge...@confluent.io
 Date:   2015-06-11T01:01:39Z

 added setup.py

 commit