Hello :) For question 1:
The hadoop consumer in the contrib directory has almost everything it needs to do distributed incremental imports out of the box, but it requires a bit of hand holding. I've created two scripts to automate the process. One of them generates initial offset files, and the other does incremental hadoop consumption. I personally use a cron job to periodically call the incremental consumer script with specific parameters (for topic and HDFS path output). You can find all of the required files in this gist: https://gist.github.com/1671887 The LinkedIn guys promised to release their full Hadoop/Kafka ETL code eventually but I think they didn't have time to get around to it yet. When they do release it, it's probably going to be better than my scripts, but for now, I think those scripts are the only publically available way to do this stuff without writing it yourself. I don't know about question 2 and 3. I hope this helps :) ! -- Felix On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <p...@forward.co.uk> wrote: > Hi, > > I'm investigating using Kafka and would really appreciate getting some > more experienced opinion on the way things work together. > > Our application instances are creating Protocol Buffer serialized messages > and pushing them to topics in Kafka: > > * Web log requests > * Product details viewed > * Search performed > * Email registered > etc... > > I would like to be able to perform incremental loads from these topics > into HDFS and then into the rest of the batch processing. I guess I had 3 > broad questions > > 1) How do people trigger the batch loads? Do you just point your > SimpleKafkaETLJob input to the previous runs outputted offset file? Do you > move files between runs of the SimpleKafkaETLJob- move the part-* file into > one place and move the offsets into an input directory ready for the next > run? > > 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper > outputs Long/Text writables and is marked as deprecated (this is in the 0.7 > source). Is there an alternative class that should be used instead, or is > the hadoop-consumer being deprecated overall? > > 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines, > are most people using Kafka for passing text messages around or using JSON > data etc.? > > Thanks, > Paul