Re: Incremental Hadoop + SimpleKafkaETLJob

Paul Ingles Wed, 25 Jan 2012 00:31:17 -0800

Thanks Felix- I found your blog posts before and it really helped me figure out 
how to get things working so I'll definitely give the shell scripts a run.




On 24 Jan 2012, at 19:05, Felix GV wrote:

> Hello :)
> 
> For question 1:
> 
> The hadoop consumer in the contrib directory has almost everything it needs
> to do distributed incremental imports out of the box, but it requires a bit
> of hand holding.
> 
> I've created two scripts to automate the process. One of them generates
> initial offset files, and the other does incremental hadoop consumption.
> 
> I personally use a cron job to periodically call the incremental consumer
> script with specific parameters (for topic and HDFS path output).
> 
> You can find all of the required files in this gist:
> https://gist.github.com/1671887
> 
> The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> eventually but I think they didn't have time to get around to it yet. When
> they do release it, it's probably going to be better than my scripts, but
> for now, I think those scripts are the only publically available way to do
> this stuff without writing it yourself.
> 
> I don't know about question 2 and 3.
> 
> I hope this helps :) !
> 
> --
> Felix
> 
> 
> 
> On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <p...@forward.co.uk> wrote:
> 
>> Hi,
>> 
>> I'm investigating using Kafka and would really appreciate getting some
>> more experienced opinion on the way things work together.
>> 
>> Our application instances are creating Protocol Buffer serialized messages
>> and pushing them to topics in Kafka:
>> 
>> * Web log requests
>> * Product details viewed
>> * Search performed
>> * Email registered
>> etc...
>> 
>> I would like to be able to perform incremental loads from these topics
>> into HDFS and then into the rest of the batch processing. I guess I had 3
>> broad questions
>> 
>> 1) How do people trigger the batch loads? Do you just point your
>> SimpleKafkaETLJob input to the previous runs outputted offset file? Do you
>> move files between runs of the SimpleKafkaETLJob- move the part-* file into
>> one place and move the offsets into an input directory ready for the next
>> run?
>> 
>> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
>> outputs Long/Text writables and is marked as deprecated (this is in the 0.7
>> source). Is there an alternative class that should be used instead, or is
>> the hadoop-consumer being deprecated overall?
>> 
>> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
>> are most people using Kafka for passing text messages around or using JSON
>> data etc.?
>> 
>> Thanks,
>> Paul

Re: Incremental Hadoop + SimpleKafkaETLJob

Reply via email to