num.partitions decides the default value for the number of partitions on a server. topic.partition.count.map is a topic based override for the same. For more information on config, please see here - http://incubator.apache.org/kafka/configuration.html
Thanks, Neha On Sun, Nov 6, 2011 at 11:45 AM, Mark <static.void....@gmail.com> wrote: > Ok got it. How are partitions determined? Is this something that producer is > responsible for or can it be automatically handled by the broker? > > On 11/6/11 11:13 AM, Neha Narkhede wrote: >>>>> >>>>> Ok so the partitioning is done on the hadoop side during importing and >>>>> has >>> >>> nothing to do with Kafka partitions. >> >> That's right. >> >> Kafka partitions help scale consumption by allowing multiple consumer >> processes pull data for a topic in parallel. The parallelism factor is >> limited by the total number of Kafka partitions. For example, if a >> topic has 2 partitions, 2 Hadoop mappers can pull data for the entire >> topic in parallel. If another topic has 8 partitions, the parallelism >> factor increased by 4x. Now 8 mappers can pull all the data for this >> topic at the same time. >> >> Thanks, >> Neha >> >> On Sun, Nov 6, 2011 at 11:00 AM, Mark<static.void....@gmail.com> wrote: >>> >>> Ok so the partitioning is done on the hadoop side during importing and >>> has >>> nothing to do with Kafka partitions. Would you mind explaining what Kafka >>> partitions are used for and when one should use them? >>> >>> >>> >>> On 11/6/11 10:52 AM, Neha Narkhede wrote: >>>> >>>> We use Avro serialization for the message data and use Avro schemas to >>>> convert event objects into Kafka message payload on the producers. On >>>> the Hadoop side, we use Avro schemas to deserialize Kafka message >>>> payload back into an event object. Each such event object has a >>>> timestamp field that the Hadoop job uses to put the message into its >>>> hourly and daily partition. So if the Hadoop job runs every 15 mins, >>>> it will run 4 times to collect data into the current hour's partition. >>>> >>>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well. >>>> >>>> Thanks, >>>> Neha >>>> >>>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<static.void....@gmail.com> >>>> wrote: >>>>> >>>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer >>>>> to >>>>> load the data for >>>>> topics in daily and hourly partitions." >>>>> >>>>> Sorry for my ignorance but what exactly do you mean by loading the data >>>>> in >>>>> daily and hourly partitions? >>>>> >>>>> >>>>> On 11/6/11 10:26 AM, Neha Narkhede wrote: >>>>>> >>>>>> There should be no changes to the way you create topics to achieve >>>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the >>>>>> InputFormat provided in contrib/hadoop-consumer to load the data for >>>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10 >>>>>> mins or so. So the maximum delay of data being available from >>>>>> producer->Hadoop is around 10 mins. >>>>>> >>>>>> Thanks, >>>>>> Neha >>>>>> >>>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<static.void....@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> This is more of a general design question but what is the preferred >>>>>>> way >>>>>>> of >>>>>>> importing logs from Kafka to HDFS when you want your data segmented >>>>>>> by >>>>>>> hour >>>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" >>>>>>> or >>>>>>> does >>>>>>> one need to create their topics around the way they would like to >>>>>>> import >>>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is >>>>>>> there >>>>>>> any >>>>>>> documentation/best practices on topic/key design? >>>>>>> >>>>>>> Thanks >>>>>>> >