You are hitting https://issues.apache.org/jira/browse/KAFKA-278
The partitioning semantics that Kafka 0.7.x provides is sort of weak for it to be used for sticky partitioning features like you need here. The reason is that partitions hosted on a broker can go offline when a broker goes offline. This means that during this downtime, the data meant for the subsets hosted on those partitions will be sent to other subsets on the fly, or they will be dropped (depending on how you configure the partitioner). In Kafka 0.8, partitions are always available in spite of individual broker failures. These strong durability guarantees enables sticky partitioning to work as expected. Hope that helps, Neha On Mon, Oct 22, 2012 at 10:22 AM, David Harris <dhar...@avum.com> wrote: > Hi Everyone, > > I want to have particular subsets of my data sent to different partitions so > that I can have consumers across different machines (or multiple instances > of the consumers running in different threads) handle the subsets of data. > The definition of these subsets is important, meaning data of type 1 needs > to go into subset 1 etc. > > My set up is that I have kafka (0.7.1) and zookeeper running on a single > machine like described in the quick start guide. In my server.properties > file I’ve set num.partitions=4. > > I’m working on testing this all out with a simple class that passes one > character strings [a-z] as messages, and I have a > kafka.producer.Partitioner that puts a-g, h-m, n-s and t-z into separate > partitions. The issue I’m having is that when running my code for the first > time (i.e. the topic doesn't exist in kafka yet) I’m seeing that in the > “public int partition(String s, int numPartitions)” method of my Partitioner > the numPartitions is 1 the first few times it is called, then after a while > its coming as 4. In my example this is causing some w, y, z etc to be > included in the partition with a, b and c’s. If I’ve already run the code > once and I see the four folders under the /tmp/kafka-logs for my partition > everything works as expected. > > I’ve attached my test code that shows this issue. (I believe that > attachments come across, if not I can paste in the body of the email). I’m > not sure if I’m doing something wrong in the code or if I’m approaching this > problem in the wrong way. It seems that an alternative approach would be to > have a separate topic for each subset of data, and then have my producer > push to the different topics. Any advice/suggestions? > > On a related note, when reading up about this topic in the quick start guide > I see that it describes creating a ProducerData object like the following: > ProducerData<String, String> data = new ProducerData<String, > String>("test-topic", "test-key", "test-message"); > But looking at the API docs I see that the only constructor that allows me > to specify a key takes a List[v] as the third parameter: > ProducerData(topic: String, key: K, data: List[V]) > Am I missing something here? > > Thanks > David Harris >