Re: Hadoop import

Neha Narkhede Sun, 06 Nov 2011 10:53:19 -0800

We use Avro serialization for the message data and use Avro schemas to
convert event objects into Kafka message payload on the producers. On
the Hadoop side, we use Avro schemas to deserialize Kafka message
payload back into an event object. Each such event object has a
timestamp field that the Hadoop job uses to put the message into its
hourly and daily partition. So if the Hadoop job runs every 15 mins,
it will run 4 times to collect data into the current hour's partition.


Very soon, the plan is to open-source this Avro-Hadoop pipeline as well.

Thanks,
Neha

On Sun, Nov 6, 2011 at 10:37 AM, Mark <static.void....@gmail.com> wrote:
> "At LinkedIn we use the InputFormat provided in contrib/hadoop-consumer to
> load the data for
> topics in daily and hourly partitions."
>
> Sorry for my ignorance but what exactly do you mean by loading the data in
> daily and hourly partitions?
>
>
> On 11/6/11 10:26 AM, Neha Narkhede wrote:
>>
>> There should be no changes to the way you create topics to achieve
>> this kind of HDFS data load for Kafka. At LinkedIn we use the
>> InputFormat provided in contrib/hadoop-consumer to load the data for
>> topics in daily and hourly partitions. These Hadoop jobs run every 10
>> mins or so. So the maximum delay of data being available from
>> producer->Hadoop is around 10 mins.
>>
>> Thanks,
>> Neha
>>
>> On Sun, Nov 6, 2011 at 8:45 AM, Mark<static.void....@gmail.com>  wrote:
>>>
>>> This is more of a general design question but what is the preferred way
>>> of
>>> importing logs from Kafka to HDFS when you want your data segmented by
>>> hour
>>> or day? Is there anyway to say "Import only this {hour|day} of logs" or
>>> does
>>> one need to create their topics around the way they would like to import
>>> them.. ie Topic: "search_logs/2011/11/06". If its the latter, is there
>>> any
>>> documentation/best practices on topic/key design?
>>>
>>> Thanks
>>>
>

Re: Hadoop import

Reply via email to