>> > Ok so the partitioning is done on the hadoop side during importing and has > nothing to do with Kafka partitions.
That's right. Kafka partitions help scale consumption by allowing multiple consumer processes pull data for a topic in parallel. The parallelism factor is limited by the total number of Kafka partitions. For example, if a topic has 2 partitions, 2 Hadoop mappers can pull data for the entire topic in parallel. If another topic has 8 partitions, the parallelism factor increased by 4x. Now 8 mappers can pull all the data for this topic at the same time. Thanks, Neha On Sun, Nov 6, 2011 at 11:00 AM, Mark <static.void....@gmail.com> wrote: > Ok so the partitioning is done on the hadoop side during importing and has > nothing to do with Kafka partitions. Would you mind explaining what Kafka > partitions are used for and when one should use them? > > > > On 11/6/11 10:52 AM, Neha Narkhede wrote: >> >> We use Avro serialization for the message data and use Avro schemas to >> convert event objects into Kafka message payload on the producers. On >> the Hadoop side, we use Avro schemas to deserialize Kafka message >> payload back into an event object. Each such event object has a >> timestamp field that the Hadoop job uses to put the message into its >> hourly and daily partition. So if the Hadoop job runs every 15 mins, >> it will run 4 times to collect data into the current hour's partition. >> >> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well. >> >> Thanks, >> Neha >> >> On Sun, Nov 6, 2011 at 10:37 AM, Mark<static.void....@gmail.com> wrote: >>> >>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer >>> to >>> load the data for >>> topics in daily and hourly partitions." >>> >>> Sorry for my ignorance but what exactly do you mean by loading the data >>> in >>> daily and hourly partitions? >>> >>> >>> On 11/6/11 10:26 AM, Neha Narkhede wrote: >>>> >>>> There should be no changes to the way you create topics to achieve >>>> this kind of HDFS data load for Kafka. At LinkedIn we use the >>>> InputFormat provided in contrib/hadoop-consumer to load the data for >>>> topics in daily and hourly partitions. These Hadoop jobs run every 10 >>>> mins or so. So the maximum delay of data being available from >>>> producer->Hadoop is around 10 mins. >>>> >>>> Thanks, >>>> Neha >>>> >>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<static.void....@gmail.com> >>>> wrote: >>>>> >>>>> This is more of a general design question but what is the preferred way >>>>> of >>>>> importing logs from Kafka to HDFS when you want your data segmented by >>>>> hour >>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or >>>>> does >>>>> one need to create their topics around the way they would like to >>>>> import >>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there >>>>> any >>>>> documentation/best practices on topic/key design? >>>>> >>>>> Thanks >>>>> >