Incremental Hadoop + SimpleKafkaETLJob

Paul Ingles Tue, 24 Jan 2012 00:24:58 -0800

Hi,

I'm investigating using Kafka and would really appreciate getting some more 
experienced opinion on the way things work together.


Our application instances are creating Protocol Buffer serialized messages and 
pushing them to topics in Kafka:

* Web log requests
* Product details viewed
* Search performed
* Email registered
etc...

I would like to be able to perform incremental loads from these topics into 
HDFS and then into the rest of the batch processing. I guess I had 3 broad 
questions

1) How do people trigger the batch loads? Do you just point your 
SimpleKafkaETLJob input to the previous runs outputted offset file? Do you move 
files between runs of the SimpleKafkaETLJob- move the part-* file into one 
place and move the offsets into an input directory ready for the next run?

2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper outputs 
Long/Text writables and is marked as deprecated (this is in the 0.7 source). Is 
there an alternative class that should be used instead, or is the 
hadoop-consumer being deprecated overall?

3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines, are 
most people using Kafka for passing text messages around or using JSON data 
etc.?

Thanks,
Paul

Incremental Hadoop + SimpleKafkaETLJob

Reply via email to