Thanks Felix- I found your blog posts before and it really helped me figure out how to get things working so I'll definitely give the shell scripts a run.
On 24 Jan 2012, at 19:05, Felix GV wrote: > Hello :) > > For question 1: > > The hadoop consumer in the contrib directory has almost everything it needs > to do distributed incremental imports out of the box, but it requires a bit > of hand holding. > > I've created two scripts to automate the process. One of them generates > initial offset files, and the other does incremental hadoop consumption. > > I personally use a cron job to periodically call the incremental consumer > script with specific parameters (for topic and HDFS path output). > > You can find all of the required files in this gist: > https://gist.github.com/1671887 > > The LinkedIn guys promised to release their full Hadoop/Kafka ETL code > eventually but I think they didn't have time to get around to it yet. When > they do release it, it's probably going to be better than my scripts, but > for now, I think those scripts are the only publically available way to do > this stuff without writing it yourself. > > I don't know about question 2 and 3. > > I hope this helps :) ! > > -- > Felix > > > > On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <p...@forward.co.uk> wrote: > >> Hi, >> >> I'm investigating using Kafka and would really appreciate getting some >> more experienced opinion on the way things work together. >> >> Our application instances are creating Protocol Buffer serialized messages >> and pushing them to topics in Kafka: >> >> * Web log requests >> * Product details viewed >> * Search performed >> * Email registered >> etc... >> >> I would like to be able to perform incremental loads from these topics >> into HDFS and then into the rest of the batch processing. I guess I had 3 >> broad questions >> >> 1) How do people trigger the batch loads? Do you just point your >> SimpleKafkaETLJob input to the previous runs outputted offset file? Do you >> move files between runs of the SimpleKafkaETLJob- move the part-* file into >> one place and move the offsets into an input directory ready for the next >> run? >> >> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper >> outputs Long/Text writables and is marked as deprecated (this is in the 0.7 >> source). Is there an alternative class that should be used instead, or is >> the hadoop-consumer being deprecated overall? >> >> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines, >> are most people using Kafka for passing text messages around or using JSON >> data etc.? >> >> Thanks, >> Paul