I've created another hadoop consumer which uses zookeeper. https://github.com/miniway/kafka-hadoop-consumer
With a hadoop OutputFormatter, I could add new files to the existing target directory. Hope this would help. Thanks Min 2012/7/4 Murtaza Doctor <murt...@richrelevance.com>: > +1 This surely sounds interesting. > > On 7/3/12 10:05 AM, "Felix GV" <fe...@mate1inc.com> wrote: > >>Hmm that's surprising. I didn't know about that...! >> >>I wonder if it's a new feature... Judging from your email, I assume you're >>using CDH? What version? >> >>Interesting :) ... >> >>-- >>Felix >> >> >> >>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >>casey.sybra...@six3systems.com> wrote: >> >>> >> - Is there a version of consumer which appends to an existing file on >>> HDFS >>> >> until it reaches a specific size? >>> >> >>> > >>> >No there isn't, as far as I know. Potential solutions to this would be: >>> > >>> > 1. Leave the data in the broker long enough for it to reach the size >>> you >>> > want. Running the SimpleKafkaETLJob at those intervals would give >>>you >>> the >>> > file size you want. This is the simplest thing to do, but the >>>drawback >>> is >>> > that your data in HDFS will be less real-time. >>> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>>roll >>> up >>> > / compact your small files into one bigger file. You would need to >>> come up >>> > with the hadoop job that does the roll up, or find one somewhere. >>> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >>> makes >>> > use of hadoop append instead... >>> > >>> >Also, you may be interested to take a look at these >>> >scripts< >>> >>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ >>> >I >>> >posted a while ago. If you follow the links in this post, you can get >>> >more details about how the scripts work and why it was necessary to do >>>the >>> >things it does... or you can just use them without reading. They should >>> >work pretty much out of the box... >>> >>> Where I work, we discovered that you can keep a file in HDFS open and >>> still run MapReduce jobs against the data in that file. What you do is >>>you >>> flush the data periodically (every record for us), but you don't close >>>the >>> file right away. This allows us to have data files that contain 24 >>>hours >>> worth of data, but not have to close the file to run the jobs or to >>> schedule the jobs for after the file is closed. You can also check the >>> file size periodically and rotate the files based on size. We use Avro >>> files, but sequence files should work too according to Cloudera. >>> >>> It's a great compromise for when you want the latest and greatest data, >>> but don't want to have to wait until all of the files are closed to get >>>it. >>> >>> Casey >