Hi, I'm evaluating using Kafka to aggregate some web logs and additional activity tracking for one of our projects. I'd like to know a little more about the best way to stitch things together.
The application runs across EC2 and some internal hardware. We also run a Hadoop cluster inside our office. I'd like to use Kafka to help aggregate activity together, augment it with something like Esper to do some systems monitoring work, and to pull data down to our Hadoop cluster (and ultimately into some Hive tables) for doing some offline analysis also. I notice in the hadoop-consumer README (https://github.com/kafka-dev/kafka/tree/master/contrib/hadoop-consumer) it's necessary to provide the HDFS location of the input files. I was wondering whether people had recommendations on good ways to pull data onto HDFS? My current thinking would be to use something to replicate the topic offsets onto S3 periodically and then to run distcp to periodically copy them onto HDFS? Thanks for any tips, Paul