Hi, I'm investigating using Kafka and would really appreciate getting some more experienced opinion on the way things work together.
Our application instances are creating Protocol Buffer serialized messages and pushing them to topics in Kafka: * Web log requests * Product details viewed * Search performed * Email registered etc... I would like to be able to perform incremental loads from these topics into HDFS and then into the rest of the batch processing. I guess I had 3 broad questions 1) How do people trigger the batch loads? Do you just point your SimpleKafkaETLJob input to the previous runs outputted offset file? Do you move files between runs of the SimpleKafkaETLJob- move the part-* file into one place and move the offsets into an input directory ready for the next run? 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper outputs Long/Text writables and is marked as deprecated (this is in the 0.7 source). Is there an alternative class that should be used instead, or is the hadoop-consumer being deprecated overall? 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines, are most people using Kafka for passing text messages around or using JSON data etc.? Thanks, Paul