Hello Min, In your github project source code are you missing the ConsumerConfig class? I was trying to download and play with the source code.
Thanks, murtaza On 7/3/12 6:29 PM, "Min" <mini...@gmail.com> wrote: >I've created another hadoop consumer which uses zookeeper. > >https://github.com/miniway/kafka-hadoop-consumer > >With a hadoop OutputFormatter, I could add new files to the existing >target directory. >Hope this would help. > >Thanks >Min > >2012/7/4 Murtaza Doctor <murt...@richrelevance.com>: >> +1 This surely sounds interesting. >> >> On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote: >> >>>Hmm that's surprising. I didn't know about that...! >>> >>>I wonder if it's a new feature... Judging from your email, I assume >>>you're >>>using CDH? What version? >>> >>>Interesting :) ... >>> >>>-- >>>Felix >>> >>> >>> >>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >>>casey.sybra...@six3systems.com> wrote: >>> >>>> >> - Is there a version of consumer which appends to an existing file >>>>on >>>> HDFS >>>> >> until it reaches a specific size? >>>> >> >>>> > >>>> >No there isn't, as far as I know. Potential solutions to this would >>>>be: >>>> > >>>> > 1. Leave the data in the broker long enough for it to reach the >>>>size >>>> you >>>> > want. Running the SimpleKafkaETLJob at those intervals would give >>>>you >>>> the >>>> > file size you want. This is the simplest thing to do, but the >>>>drawback >>>> is >>>> > that your data in HDFS will be less real-time. >>>> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>>>roll >>>> up >>>> > / compact your small files into one bigger file. You would need to >>>> come up >>>> > with the hadoop job that does the roll up, or find one somewhere. >>>> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >>>> makes >>>> > use of hadoop append instead... >>>> > >>>> >Also, you may be interested to take a look at these >>>> >scripts< >>>> >>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume >>>>r/ >>>> >I >>>> >posted a while ago. If you follow the links in this post, you can get >>>> >more details about how the scripts work and why it was necessary to >>>>do >>>>the >>>> >things it does... or you can just use them without reading. They >>>>should >>>> >work pretty much out of the box... >>>> >>>> Where I work, we discovered that you can keep a file in HDFS open and >>>> still run MapReduce jobs against the data in that file. What you do >>>>is >>>>you >>>> flush the data periodically (every record for us), but you don't close >>>>the >>>> file right away. This allows us to have data files that contain 24 >>>>hours >>>> worth of data, but not have to close the file to run the jobs or to >>>> schedule the jobs for after the file is closed. You can also check >>>>the >>>> file size periodically and rotate the files based on size. We use >>>>Avro >>>> files, but sequence files should work too according to Cloudera. >>>> >>>> It's a great compromise for when you want the latest and greatest >>>>data, >>>> but don't want to have to wait until all of the files are closed to >>>>get >>>>it. >>>> >>>> Casey >>