Hi,

I'm evaluating using Kafka to aggregate some web logs and additional activity 
tracking for one of our projects. I'd like to know a little more about the best 
way to stitch things together.

The application runs across EC2 and some internal hardware. We also run a 
Hadoop cluster inside our office. I'd like to use Kafka to help aggregate 
activity together, augment it with something like Esper to do some systems 
monitoring work, and to pull data down to our Hadoop cluster (and ultimately 
into some Hive tables) for doing some offline analysis also.

I notice in the hadoop-consumer README 
(https://github.com/kafka-dev/kafka/tree/master/contrib/hadoop-consumer) it's 
necessary to provide the HDFS location of the input files.

I was wondering whether people had recommendations on good ways to pull data 
onto HDFS? My current thinking would be to use something to replicate the topic 
offsets onto S3 periodically and then to run distcp to periodically copy them 
onto HDFS?

Thanks for any tips,
Paul

Reply via email to