>> >>- We have event data under the topic "foo" written to the kafka >> Server/Broker in avro format and want to write those events to HDFS. >>Does >> the Hadoop consumer expect the data written to HDFS already? > > >No it doesn't expect the data to be written into HDFS already... There >wouldn't be much point to it, otherwise, no ;) ? >
Sorry, my note was unclear. I meant the SimpleKafkaETLJob requires a sequence file with an offset written to HDFS and then uses that as a bookmark to pull the data from the broker? This file has a checksum and I was trying to modify the topic in it, which then of course messes up the checksum. I already have events generated on my Kafka server and all I wanted to do is run SimpleKafkaETLJob to pull out the data and write to HDFS. Was trying to fulfill the sequence file pre-requisite and that does not seem to work for me. > >> Based on the >> doc looks like the DataGenerator is pulling events from the broker and >> writing to HDFS. In our case we only wanted to utilize the >> SimpleKafkaETLJob to write to HDFS. > > >That's what it does. It spawns a (map only) Map Reduce job that pulls in >parallel from the broker(s) and writes that data into HDFS. > > >> I am surely missing something here? >> > >Maybe...? I don't know. Do tell if anything is not clear still...! Thanks for asserting, just want to make sure I got it right. > > >> - Is there a version of consumer which appends to an existing file on >>HDFS >> until it reaches a specific size? >> > >No there isn't, as far as I know. Potential solutions to this would be: > > 1. Leave the data in the broker long enough for it to reach the size >you > want. Running the SimpleKafkaETLJob at those intervals would give you >the > file size you want. This is the simplest thing to do, but the drawback >is > that your data in HDFS will be less real-time. > 2. Run the SimpleKafkaETLJob as frequently as you want, and then roll >up > / compact your small files into one bigger file. You would need to >come up > with the hadoop job that does the roll up, or find one somewhere. > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >makes > use of hadoop append instead... These options are very useful. I like option 3 the most :) > >Also, you may be interested to take a look at these >scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-co >nsumer/>I >posted a while ago. If you follow the links in this post, you can get >more details about how the scripts work and why it was necessary to do the >things it does... or you can just use them without reading. They should >work pretty much out of the box... Will surely give them a spin. Thanks! > >> >> Thanks, >> murtaza >> >>