Re: Hadoop import

Neha Narkhede Sun, 06 Nov 2011 11:13:40 -0800

>> > Ok so the partitioning is done on the hadoop side during importing and has
> nothing to do with Kafka partitions.


That's right.

Kafka partitions help scale consumption by allowing multiple consumer
processes pull data for a topic in parallel. The parallelism factor is
limited by the total number of Kafka partitions. For example, if a
topic has 2 partitions, 2 Hadoop mappers can pull data for the entire
topic in parallel. If another topic has 8 partitions, the parallelism
factor increased by 4x. Now 8 mappers can pull all the data for this
topic at the same time.

Thanks,
Neha

On Sun, Nov 6, 2011 at 11:00 AM, Mark <static.void....@gmail.com> wrote:
> Ok so the partitioning is done on the hadoop side during importing and has
> nothing to do with Kafka partitions. Would you mind explaining what Kafka
> partitions are used for and when one should use them?
>
>
>
> On 11/6/11 10:52 AM, Neha Narkhede wrote:
>>
>> We use Avro serialization for the message data and use Avro schemas to
>> convert event objects into Kafka message payload on the producers. On
>> the Hadoop side, we use Avro schemas to deserialize Kafka message
>> payload back into an event object. Each such event object has a
>> timestamp field that the Hadoop job uses to put the message into its
>> hourly and daily partition. So if the Hadoop job runs every 15 mins,
>> it will run 4 times to collect data into the current hour's partition.
>>
>> Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.
>>
>> Thanks,
>> Neha
>>
>> On Sun, Nov 6, 2011 at 10:37 AM, Mark<static.void....@gmail.com>  wrote:
>>>
>>> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer
>>> to
>>> load the data for
>>> topics in daily and hourly partitions."
>>>
>>> Sorry for my ignorance but what exactly do you mean by loading the data
>>> in
>>> daily and hourly partitions?
>>>
>>>
>>> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>>>
>>>> There should be no changes to the way you create topics to achieve
>>>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>>>> InputFormat provided in contrib/hadoop-consumer to load the data for
>>>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>>>> mins or so. So the maximum delay of data being available from
>>>> producer->Hadoop is around 10 mins.
>>>>
>>>> Thanks,
>>>> Neha
>>>>
>>>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<static.void....@gmail.com>
>>>>  wrote:
>>>>>
>>>>> This is more of a general design question but what is the preferred way
>>>>> of
>>>>> importing logs from Kafka to HDFS when you want your data segmented by
>>>>> hour
>>>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>>>> does
>>>>> one need to create their topics around the way they would like to
>>>>> import
>>>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>>>> any
>>>>> documentation/best practices on topic/key design?
>>>>>
>>>>> Thanks
>>>>>
>

Re: Hadoop import

Reply via email to