Re: Hadoop import

Neha Narkhede Sun, 06 Nov 2011 21:09:41 -0800

num.partitions decides the default value for the number of partitions
on a server. topic.partition.count.map is a topic based override for
the same. For more information on config, please see here -
http://incubator.apache.org/kafka/configuration.html


Thanks,
Neha

On Sun, Nov 6, 2011 at 11:45 AM, Mark <static.void....@gmail.com> wrote:
> Ok got it. How are partitions determined? Is this something that producer is
> responsible for or can it be automatically handled by the broker?
>
> On 11/6/11 11:13 AM, Neha Narkhede wrote:
>>>>>
>>>>> Ok so the partitioning is done on the hadoop side during importing and
>>>>> has
>>>
>>> nothing to do with Kafka partitions.
>>
>> That's right.
>>
>> Kafka partitions help scale consumption by allowing multiple consumer
>> processes pull data for a topic in parallel. The parallelism factor is
>> limited by the total number of Kafka partitions. For example, if a
>> topic has 2 partitions, 2 Hadoop mappers can pull data for the entire
>> topic in parallel. If another topic has 8 partitions, the parallelism
>> factor increased by 4x. Now 8 mappers can pull all the data for this
>> topic at the same time.
>>
>> Thanks,
>> Neha
>>
>> On Sun, Nov 6, 2011 at 11:00 AM, Mark<static.void....@gmail.com>  wrote:
>>>
>>> Ok so the partitioning is done on the hadoop side during importing and
>>> has
>>> nothing to do with Kafka partitions. Would you mind explaining what Kafka
>>> partitions are used for and when one should use them?
>>>
>>>
>>>
>>> On 11/6/11 10:52 AM, Neha Narkhede wrote:
>>>>
>>>> We use Avro serialization for the message data and use Avro schemas to
>>>> convert event objects into Kafka message payload on the producers. On
>>>> the Hadoop side, we use Avro schemas to deserialize Kafka message
>>>> payload back into an event object. Each such event object has a
>>>> timestamp field that the Hadoop job uses to put the message into its
>>>> hourly and daily partition. So if the Hadoop job runs every 15 mins,
>>>> it will run 4 times to collect data into the current hour's partition.
>>>>
>>>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>>>>
>>>> Thanks,
>>>> Neha
>>>>
>>>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<static.void....@gmail.com>
>>>>  wrote:
>>>>>
>>>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer
>>>>> to
>>>>> load the data for
>>>>> topics in daily and hourly partitions."
>>>>>
>>>>> Sorry for my ignorance but what exactly do you mean by loading the data
>>>>> in
>>>>> daily and hourly partitions?
>>>>>
>>>>>
>>>>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>>>>>
>>>>>> There should be no changes to the way you create topics to achieve
>>>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>>>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>>>>> mins or so. So the maximum delay of data being available from
>>>>>> producer->Hadoop is around 10 mins.
>>>>>>
>>>>>> Thanks,
>>>>>> Neha
>>>>>>
>>>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<static.void....@gmail.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>> This is more of a general design question but what is the preferred
>>>>>>> way
>>>>>>> of
>>>>>>> importing logs from Kafka to HDFS when you want your data segmented
>>>>>>> by
>>>>>>> hour
>>>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs"
>>>>>>> or
>>>>>>> does
>>>>>>> one need to create their topics around the way they would like to
>>>>>>> import
>>>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is
>>>>>>> there
>>>>>>> any
>>>>>>> documentation/best practices on topic/key design?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>

Re: Hadoop import

Reply via email to